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ABSTRACT 


The  present  thesis  examines  a  technique  for  automatically  clas¬ 
sifying  documents  according  to  their  subject  categories.  Experiments 
are  described  for  a  data  base  of  1572  titles  of  papers  published  by  the 
Journal  of  Acoustical  Society  of  America  in  1966,  1967,  1968,  and  1961. 

The  feasibility  of  using  latent  class  analysis  for  the  document 
classification  is  tested  by  two  experiments.  The  technique  proposed  by 
F.  B.  Baker  and  W.  K.  Winters  is  found  to  be  unsuitable  for  practical 
application  to  document  classification,  because  the  matrices  required 
by  the  theory  to  be  positive  definite  are  in  fact  found  to  be  non¬ 
positive  definite.  Another  attempt  to  solve  the  accounting  equations 
that  describe  the  latent  class  structure  is  based  on  the  optimization 
technique.  This  method  requires  an  enormous  amount  of  computation  time 
and  still  does  not  determine  meaningful  classes.  It  is  concluded  that 
latent  class  analysis  is  not  a  useful  technique  for  solution  of  the 
problem  of  document  classification. 

The  classification  method  based  on  attribute  analysis  proposed  by 
M.  E.  Maron  is  applied  to  the  classification  of  the  acoustical  liter¬ 
ature.  With  use  of  a  proposed  procedure  for  choice  of  keywords  from 
document  titles  the  results  appear  to  be  very  satisfactory.  In  partic¬ 
ular,  Maron 's  assumption  that  keywords  of  a  document  occur  in  a  statis¬ 
tically  independent  manner  does  not  appear  to  reduce  the  effectiveness 
of  the  classification. 

A  modified  application  of  attribute  analysis  to  document  clas¬ 
sification  is  proposed  through  maximization  of  correct  classifications 
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of  base  documents  using  not  more  than  two  keywords  in  the  computation 
of  joint  word  occurrences,  but  without  use  of  approximate  estimates. 
The  results  are  slightly  superior  to  those  of  Maron's  method. 
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CHAPTER  I 


INTRODUCTION 

1 . 1  General . 

In  recent  years  a  number  of  investigations  and  experiments  have  been 
undertaken  in  various  aspects  of  automatic  documentation.  They  have  dealt 
with  the  structure,  analysis,  organization,  storage,  search,  and  retrieval 
of  information.  As  a  result,  the  conceptual  analysis  of  documents  has 
become  a  basic  consideration  in  document  handling. 

In  conventional  library  systems  trained  people  analyze  the  subject 
matter  of  documents  and  either  assign  index  words  to  them  or  else  classify 
them  in  accordance  with  existing  hierarchical  classification  schedules. 

At  the  present  time  the  rate  of  growth  of  documentary  data  is  sufficiently 
high  that  many  libraries  face  serious  problems  concerning  the  size  of 
storage  media,  the  method  of  file  organization,  and  the  education  of 
skillful  librarians.  As  a  result  of  increases  in  the  quantity  of  inform¬ 
ation  there  are  strong  demands  for  the  creation  of  services  to  supply 
needed  information  that  is  directly,  or  indirectly,  related  to  the  inter¬ 
ests  of  particular  researchers.  However,  it  is  very  time  consuming  to 
handle  mass  information  manually  because  many  research  subjects  are  not 
limited  to  narrow  fields;  but  tend  to  spread  over  other  related  fields. 

In  many  automatic  documentation  systems  the  storage  of  information 
is  not  the  main  problem.  It  may  be  solved  by  provision  of  sufficient 
hardware  devices  such  as  magnetic  tapes,  discs,  drums,  magnetic  cards, 
and  microfilm,  and  so  forth.  Much  manual  work  may  be  eliminated  by  use 
of  mechanization.  Furthermore,  the  use  of  computers  allows  more 
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sophisticated  document  processing  such  as  automatic  retrieval,  abstract¬ 
ing,  indexing,  and  classification.  However,  even  with  use  of  automation 
there  still  remain  serious  problems  in  the  analysis  and  the  identi¬ 
fication  of  content. 

In  the  early  1960's  G.  Salton  and  his  group  at  Harvard  University 
designed  the  system  known  as  SMART,  Salton's  Magical  Automatic  Retrieval 
Technique  (17,  18).  It  is  a  fully  mechanized  information  system  and  is 
in  operation  at  Harvard  and  Cornell  Universities.  The  outstanding  feat¬ 
ure  of  the  SMART  system  is  that  it  may  use  several  hundred  different 
forms  of  content  analysis  in  order  to  determine  the  correct  words  that 
should  be  used  to  represent  and  search  documents.  The  techniques  include 
use  of  a  thesaurus,  statistical  word  associations,  syntactic  analysis, 
statistical  phrase  recognition,  and  hierarchical  arrangement  of  concepts. 
Implementation  of  the  SMART  system  has  helped  to  prove  the  practical 
feasibility  of  automatic  information  processing. 

According  to  Richardson's  definition,  (16)  "classification"  is  the 
putting  together  of  like  things.  Every  entity,  nature,  idea,  and  art 
may  be  analyzed  and  classified  in  accordance  with  appropriate  classi¬ 
fication  schedules.  The  present  thesis,  however,  concentrates  on  the 
classification  of  scientific  documents  that  are  described  by  natural 
language  such  as  used  in  titles,  abstracts,  keywords,  and  subject  headings. 

There  exist  general  classification  schemes  such  as  the  Universal 
Decimal  Classification  (UDC)  (22),  the  Dewey  Decimal  Classification  (DC) 
(4),  the  Library  of  Congress  Classification  (LC)  (9),  and  the  Colon 
Classification  (CC)  (14).  They  are  not  satisfactory  enough  for  classi¬ 
fication  of  highly  specialized  subjects  because  they  do  not  sufficiently 
represent  the  details  of  a  complex  subject,  and  they  do  not  provide 


. 

■ 


3 


sufficient  flexibility  in  classification  of  documents  that  relate  to 
several  fields.  In  order  to  overcome  these  disadvantages,  "Faceted 
Classification"  was  developed  by  Vickery  (23),  and  "Analytico-Synthetic 
Classification"  by  Ranganathan  (15).  For  these  classifications  the  main 
facets  in  each  subject  field  must  be  generated.  For  example,  in  the  sub¬ 
ject  field  of  Food  Technology  there  may  be  four  facets,  Products,  Parts, 
Materials,  and  Operations,  and  these  main  facets  may  be  further  divided 
into  sub-facets  and  sub-sub-facets,  and  so  forth.  Obviously  these  techniq¬ 
ues  make  it  possible  to  analyze  the  document  concepts  in  greater  depth. 

When  adapted  to  automated  systems  the  existing  general  classification 
schemes  referred  to  above  require  considerable  help  from  human  beings 
since  the  conceptual  analysis  of  documents  is  performed  manually.  One  of 
the  aims  of  research  in  the  field  of  document  classification  is  to  clearly 
understand  the  relationships  between  document  content  and  assigned  subject 
categories.  With  such  an  understanding  it  is  hoped  that  subject  categor¬ 
ies  may  be  assigned  automatically  by  computer  examination  of  the  document 
content. 

The  extent  to  which  subject  categories  may  be  chosen  by  automatic 
examination  of  document  content  is  the  subject  of  the  present  thesis • 
Attention  is  confined  to  examination  of  titles  only.  Comparison  is  made 
with  the  results  of  manual  classification  based  solely  on  examination  of 
titles.  Accordingly,  the  aim  of  the  present  investigation  is  to  compare 
and  evaluate  several  methods  of  automatic  classification,  to  modify  them 
if  necessary,  and  to  compare  their  effectiveness  with  that  of  manual 
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measure  of  the  degree  of  correlation  of  words  in  terms  of  their  frequen¬ 
cies  of  occurrence,  and  he  attempted  to  formulate  the  means  to  calculate 
it  automatically  in  terms  of  the  association  factor. 

1 .3  Latent  Class  Analysis. 

Latent  class  analysis  was  first  introduced  by  Lazarsfeld  (8)  for 
application  in  the  field  of  social  psychology  in  order  to  analyze  a  set  of 
questionnaires  to  assess  the  attitude  of  army  personnel  in  terms  of  various 
factors.  The  analysis  is  based  on  a  mathematical  model  based  on  the 
assumption  that  a  set  of  data  described  by  statistics  may  be  divided  into 
small  sets  such  that  in  each  group  the  probabilities  of  different  word 
incidences  are  statistically  independent. 

In  that  statistical  independence  of  incidences  it  is  assumed  within 
any  group  both  latent  class  analysis  and  attribute  analysis  are  essential¬ 
ly  the  same.  However,  there  is  considerable  difference  in  the  procedure 
used  to  derive  the  estimates  of  the  necessary  probabilities.  In  attribute 
analysis  the  probabilities  which  are  used  to  predict  the  attribute  of  a 
whole  are  derived  from  a  pre-existing  relatively  small  amount  of  data  which 
has  already  been  classified.  On  the  other  hand,  in  latent  class  analysis, 
the  probabilities  are  generated  directly  from  the  attributes.  The  advan¬ 
tage  of  latent  class  analysis  is  that  the  automatic  classification  groups 
may  be  derived  from  the  automatic  generation  of  the  latent  class  structure, 
whereas  when  based  on  attribute  analysis,  the  groups  depend  on  a  previously 
chosen  set  of  categories. 

In  1954,  T.  W.  Anderson  (1)  proposed  a  method  for  the  numerical 
solution  of  certain  equations  that  involve  probabilities  and  which  arise  in 
construction  of  the  latent  class  model.  The  Anderson  technique  was 


* 

' 


. 


■ 


6 


developed  to  overcome  the  inherent  difficulty  of  the  method  suggested 
earlier  by  B.  Green  (6),  in  which  the  values  of  the  elements  of  certain 
required  matrices  cannot  be  defined  precisely,  and  hence  must  be  approxi¬ 
mated.  Anderson  formed  square  matrices  of  elements  that  represent 
correlation  probabilities  of  keywords,  and  he  applied  eigenvalue  tech¬ 
niques.  However,  he  did  not  note  that  asymmetric  matrices  do  not 
necessarily  have  real  eigenvalues. 

In  1962,  F.  B.  Baker  (2,  3)  first  realized  that  the  latent  class 
structure  may  be  directly  applied  to  the  field  of  document  classification 
and,  in  fact,  could  be  used  to  provide  the  necessary  mathematical 
foundation  for  a  method  of  automatic  classification. 

The  difficulties  that  arise  through  introduction  of  asymmetric 
matrices  may  be  overcome  by  use  of  the  latent  class  formulation  proposed 
by  Winters  (24).  It  is  a  modification  of  Anderson's  technique,  and  leads 
to  generation  of  symmetric  matrices  and  hence  real  eigenvalues.  The 
elements  of  Winters'  matrices  represent  probabilities  of  occurrences  of 
single  keywords,  double  keywords,  and  triple  keywords.  Use  of  combinat¬ 
ions  of  more  keywords  may  construct  a  firmer  latent  class  model,  but  the 
probabilities  of  such  combinations  become  small  or  zero,  and  may  be  neg¬ 
lected  in  practice  in  the  construction  of  latent  classes. 

In  application  of  the  method  of  Winters,  eigenvalues  are  required  to 
describe  probabil ities.  Winters  did  not  discuss  the  conditions  required 
to  ensure  that  the  eigenvalues  lie  between  0  and  lj  yet  this  condition 
is  essential  if  the  eigenvalues  are  to  represent  probabilities. 

1 . 4  Statement  of  the  Approach  of  Subsequent  Chapters. 

The  purpose  of  the  present  thesis  is  to  critically  examine  and,  if 
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necessary,  develop  the  methods  of  statistical  analysis  for  automatic 
classification  in  terms  of  association  of  keywords  and  subject  categories. 

Chapter  II  contains  a  discussion  of  latent  class  analysis,  with 
emphasis  on  consideration  of  the  practicality  of  the  method  of  Winters  in 
so  far  as  the  required  numerical  computations  are  concerned.  An  experi¬ 
mental  attempt  to  apply  latent  class  analysis  to  an  existing  document  data 
base  is  described  in  Chapter  III.  It  is  demonstrated  that,  contrary  to 
the  hopes  of  Baker,  the  method  of  latent  class  analysis  is  not  suitable  for 
automatic  determination  of  document  categories.  Chapter  IV  contains  a 
discussion  of  attribute  analysis  and  the  experimental  results  obtained  by 
Maron. 

The  experimental  results  described  in  the  present  thesis  were  obtain¬ 
ed  by  use  of  a  data  base  that  contains  references  to  journal  articles  in 
the  field  of  acoustics.  The  data  base  and  its  subject  categories  are 
described  in  Chapter  V. 

Application  of  Maron 's  method  of  attribute  analysis  is  made  in  Chapter 
VI.  Although  the  method  is  not  new  it  is  believed  that  the  results  are  of 
value  in  providing  assessment  of  Maron's  method,  since  the  data  base  is 
much  larger  than  that  used  by  Maron,  and  therefore  it  provides  a  more 
realistic  example  of  a  document  data  base.  Furthermore,  the  categories 
used  by  Maron  were  the  result  of  his  modification  of  an  existing  classi¬ 
fication  scheme,  whereas  the  categories  used  in  the  present  experiment 
are  ones  that  have  been  in  use  since  1961.  It  is  therefore  believed  that 
the  experimental  results  provide  a  useful  measure  of  the  effectiveness  of 
Maron's  method  in  comparison  with  a  well  established  and  accepted  method 

of  manual  assignment  of  categories. 

The  classification  obtained  in  Chapter  VI  is  based  on  use  of  keywords 
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chosen  from  document  titles  whereas  the  results  of  Maron  were  based  on 
use  of  keywords  selected  from  document  abstracts.  The  results  of 
Chapter  VI  indicate  that  the  method  of  choice  of  keywords  from  titles 
leads  to  automatic  classification  that  is  as  good  as  that  obtained  by 
Maron  when  using  keywords  chosen  from  abstracts. 

In  Chapter  VII  there  are  introduced  some  modifications  of  attribute 
analysis.  The  results  are  compared  with  those  of  Chapter  VI. 
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CHAPTER  II 

LATENT  CLASS  ANALYSIS 


2.1  General . 

In  application  of  latent  class  analysis  to  automatic  document  clas¬ 
sification  systems  it  is  supposed  that,  within  an  entire  corpus  of  docum¬ 
ents,  there  exists  a  set  of  non-intersecting  classes  in  which  the  occur¬ 
rence  of  each  keyword  in  a  document  is  statistically  independent  of  the 
occurrences  of  other  keywords.  The  latent  class  analysis  then  proceeds 
through  use  of  probabilities  that  describe  associations  between  latent 
classes  and  certain  combinations  of  keywords.  The  associations  are 
formulated  in  the  form  of  probabilities  that  a  document  with  a  particular 
combination  of  keywords  belongs  to  any  of  latent  classes. 

F.  B.  Baker  (2)  first  attempted  to  apply  techniques  employed  by 
Lazarsfeld  (8).  He  proposed  the  mathematical  model  of  latent  class  analy¬ 
sis,  and  suggested  how  to  use  it. 

W.  K.  Winters  (24)  modified  Baker's  latent  class  structure  and 

discussed  the  numerical  procedures  required. 

Use  of  a  large  number  of  keywords  allows  the  numerical  methods  to 
determine  close  approximations  to  the  latent  classes,  but  because  of  the 
complexity  of  the  computations  the  number  of  considered  combinations  of 
keywords  must  be  limited.  Furthermore,  the  most  difficult  problem  involved 
in  the  construction  of  latent  classes  is  determination  of  the  number  of 
classes  to  be  sought.  It  seems  that  there  is  no  firm  theory  to  determine 
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this  number.  Baker  proposed  the  inequality  (N  +  1)  /  2  >  L  between  the 
number  of  keywords  and  the  number  of  classes,  where  N  denotes  the  number 
of  keywords  and  L  denotes  the  number  of  classes.  However,  the  manner  in 
which  he  obtained  this  inequality  is  not  explained.  In  order  to  make  the 
numerical  solution  feasible  in  practice.  Winters  assumed  that  the  number 
of  latent  classes  is  equal  to  the  number  of  existing  keywords,  that  is, 

L  =  N.  Clearly  Winters'  assumption  is  in  disagreement  with  the  Baker 
inequality,  and  there  is  a  need  for  further  investigation  before  either 
relation  between  N  and  L  may  be  used  with  any  degree  of  confidence. 

2.2  Latent  Class  Structure 

The  latent  class  analysis  used  the  following  probabilities  for  key¬ 
word  occurrences  in  the  entire  set  of  documents: 

p  =  probability  that  a  document  contains  the  keyword  K . , 

p..  =  probability  that  a  document  contains  both  keywords  K.  and  K ^ , 

p  =  probability  that  a  document  contains  three  keywords  K . , 

1 J  K 

K and  K^. 

For  a  document  that  belongs  to  the  latent  class  C^,  the  probabilities 
h*,  h%,  and  h*..  are  defined  as  follows: 

h^  =  probability  that  the  document  contains  the  keyword  K. , 

h^.  =  probability  that  the  document  contains  both  keywords 

•  0 

K.  and  K., 

vJ 

h*  =  probability  that  the  document  contains  three  keywords 
1  J  K 

K. ,  K • ,  and  K.  . 
i  J  K 

For  an  arbitrary  document  chosen  from  the  entire  data  the  probability 
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is  defined  as  follows: 


9  =  probability  that  the  document  belongs  to  the  latent  class  C„. 

X/ 

With  N  keywords  and  L  latent  classes,  there  are  2N  p's,  L  g's  and 
N 

L  x  2  h's.  The  relationships  between  the  p's,  g's,  and  h's  are  expres¬ 
sed  by  the  following  equations,  known  as  the  accounting  equations: 

Pi  =  I  aV  (2.1) 

£=1  1 

pij  =  9#,hij  (for  1  *  j)  (2.2) 

pijk  =  ^  g£hijk  (for  1  +  ^  k»  and  J  7*  k)  (2.3) 

etc. 

There  is  one  more  equation,  namely 

1  =[  9*  (2.4) 

£-1 

which  expresses  the  fact  that  a  document  belongs  to  one,  and  only  one, 
latent  class. 

Basically  the  problem  of  latent  class  analysis  is  to  find  the 
solution  for  the  g's  and  h's  to  satisfy  the  accounting  equations  (2.1), 
(2.2),  (2.3),  etc,,  and  (2.4). 

The  defining  property  of  a  latent  class  is  that,  for  all  documents 
within  it,  the  keywords  occur  in  a  statistically  independent  manner  so 


that 


. 

. 


K9  rlD?  dW 

. 


The  accounting  equations  may  then  be  rewritten  in  the  form 
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Pi  = 

IgV 

z 

(2.5) 

pij  = 

Ig*h*hi 

SL  J 

0  t  j) 

(2.6) 

pijk  = 

v  L.Z.Z.Z 

L  9  hihihk 
£=1  1  3  K 

(i  t  j,  i  t  k,  and  j  +  k) 

(2.7) 

etc. . 

Once  the  unknown  g's  and  h's  have  been  estimated,  the  degree  of  assoc¬ 
iation  between  a  document  that  contains  a  particular  combination  of  key¬ 
words,  say  K-j ,  , . . . .  ,  and  the  latent  class  may  be  defined  as  the 

probability 


o  s 

°g  h;  0  M 
1 »2. .  .M 


L 

l 

K=1 


DgV 


1,2 _ M 


(2.8) 


0 

where  D  is  the  total  number  of  documents.  Then  p  is  the  probability 
that  a  document  indexed  by  keywords  K-j ,  k^,...,!^  belongs  to  the  latent 
class  C^. 

0  0  0  Q 

By  the  independence  assumption,  h-j  2  ^  =  h-jh2 . . . hjjj ,  and  hence 


g  h-j  h2 . . . h^ ( 1 -h^^-j ) . . .  ( 1 -hj^ ) 

v  K.K.K  .K,,  .K  v  n  .Kx 
I  9  hi h2 . .  *h^(l-h^+i ) . . .  ( 1  “ hjvj ) 
K=1 


(2.9) 


which  is  computable  in  terms  of  the  g's  and  h's.  The  latent  class  for 
which  this  probability  assumes  its  maximum  value  is  the  class  that  is 
assigned  to  the  given  document.  This  probability  is  called  the  "order¬ 
ing  ratio". 
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2.3  Numerical  Solution  of  Winters. 

Under  the  assumption  that  the  number  of  keywords  equals  the  number 
of  latent  classes.  Winters  used  a  modification  of  T.  W.  Anderson's 
technique  to  propose  one  possible  solution  in  matrix  notation. 

Defining  five  NxN  (or  LxL)  square  matrices  as  follows: 


P 


P 


PN 

P1,N 

• 

• 

• 

** 

C\J 

CL 

PN-1,N 

P1,N 

P1,1,N 

P1 ,2,N *  *  * 

P1,N-1,N 

P2,N 

P2,1,N 

P2,2,N‘  *  * 

P2,N-1,N 

PN-1,N 

PN-1 ,1  ,N 

PN-1 ,2,N*  ’  * 

PN-1 ,N-1  ,N 

1 

pl 

• 

• 

• 

• 

C\l 

Cl 

PN-1 

P1 

Pl,l 

P1  ,2  *  *  ’ 

P1 ,N-1 

P2 

• 

p2,l 

• 

P2,2*  ” 

• 

P2,N-1 

• 

• 

• 

pN-l 

• 

• 

PN-1 ,1 

• 

• 

PN-1 ,2‘  ‘  ’ 

• 

• 

PN-1 ,N-1 

(2.10) 


(2.11) 


. 

' 
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1 

1 

1 


o  0  ....  o 

o  hjj  0  . . . .  o 

0  0  hj*  ....  0 


(2.12) 


(2.13) 


0 


— 


0 


and 
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r 


g 

o 

o 


o 


o 

f 

g‘ 

o 


o 


o 


o  .... 


3 

g 


o 


o 

o 

o 


N 


(2.14) 


where  P  and  P  are  symmetric,  and  H  and  G  are  diagonal  matrices,  the  above 
accounting  equations  may  be  rewritten  in  the  form 


P  =  H'GHH  (2.15) 

and 

P  =  H'GH  (2.16) 

where  H‘  denotes  the  transpose  of  H.  However,  these  matrix  forms 
(2.15)  and  (2.16)  represent  combinations  of  up  to  only  three  keywords  in  the 
accounting  equations.  Because  the  occurrence  of  any  given  set  of  more 
than  three  keywords  in  a  document  may  be  relatively  rare,  then  most  of  the 
neglected  probabilities  are  zero  or  very  small,  so  that  neglecting  the 
combinations  of  more  than  three  keywords  should  not  have  any  serious 
effect  on  the  results. 

Winters  made  the  assumption  that  the  matrices  H,  G,  and  H  are 
non-singular.  This  assumption  implies  that  all  the  diagonal  g's  and 
h's  must  be  non-zero.  Using  this  assumption  he  proved  that  P  and  P  are 

A 

positive  definite  so  that  all  eigenvalues  of  P  and  P  are  positive. 

In  order  to  solve  the  system  described  by  (2.15)  and  (2.16)  consider 


'  1:1  •>.  .  jqsfli  ?fu  29J 

■  *f V 


the  following  generalized  eigenvalue  problem: 
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where  x  is  an  eigenvector  associated  with  the  eigenvalue  X.  Defining  a 
matrix  T  which  satisfies  the  condition  of 


then  the  matrix  T'PT  has  eigenvalues  equal  to  the  solutions  X  of 
equation  (2.17). 

The  fact  can  be  proved  as  follows: 

Pre-mul tiply  the  equation  (2.17)  by  T*  to  obtain 


T'Px-XT'Px  (2.19) 

Let  x  =  Ty_.  Substituting  it  in  formula  (2.19),  we  get 

T'PTX  =  AT'PT£  (2.20) 

✓s. 

Since  T'PT  =  I,  the  equation  (2.20)  becomes 

T'PJ1=\y_  (2.21) 

and  hence  X  is  an  eigenvalue  of  the  matrix  T'PT. 

The  purpose  of  this  numerical  technique  is  to  derive  the  unknown 

A 

matrices  H,  H,  and  G  defined  in  (2.12),  (2.13),  and  (2.14)  respectively. 

A 

It  may  be  shown  that  the  diagonal  matrix  H  can  be  obtained  by 
solving  the  characteristic  equation 


T'PT-Xll  =  0 


(2.22) 


The  proof  is  as  follows: 


«r.s)  1  ,lvf  ***  ■ 

'i  )  irr  r  vioj  9rh  ;  2  >•:  '  * 


l-o  A  enorduloe  oi  F gups  29uffivn9D  9  2nd  TV  r  yiidiiir  •rli'  n  'rlr 


f;Ofj£  ip9  D :.  c  n:  \  b  erb  ,:i  nfvfo, 


(2.23) 


=  |T'H'GHHT-AT'H'GHT| 

=  | T 1  I  I H ' I | G | | H-X 1 1 I H I | T | 

where  H,  G,  and  T  are  assumed  to  be  non-singular.  Therefore, 

0  =  I H-AI |  (2.24) 

A 

Thus,  since  H  is  a  diagonal  matrix  its  elements  h^  are  equal  to  the 
eigenvalues  A  of  equation  (2.22). 

We  note  that  all  eigenvectors  x/s  which  are  column  vectors  may  be 
arranged  to  form  a  square  matrix  X.  Assume  that  there  exists  a  dia¬ 
gonal  matrix  D  which  satisfies  the  relation 


PX  =  H'GD 


(2.25) 


The  matrices  G  and  D  are  diagonal  so  that  GD  is  also  a  diagonal  matrix, 
and  furthermore,  the  first  row  of  H'  consists  of  all  ones.  Therefore 
the  diagonal  elements  of  GD  must  be  equal  to  the  elements  on  the  first 

A 

row  of  PX.  By  post  multiplying  the  formula  (2.25)  by  (GD)'1  we  may 
obtain  in  the  form 


PX(GD)'1  =  H'  (2.26) 

A 

Substituting  P  =  H'GH,  and  eliminating  H ' ,  the  formula  (2.26)  can  be 
rewritten  as  follows: 


GHX(GD)"1  =  I 

so  that  finally  G  may  be  expressed  in  the  form 


(2.27) 


HX(GD)'1  =  G"1  (2.28) 

It  should  be  noted  that  (GD)"  is  easily  computed  by  taking  the  recip- 
rocal  of  each  diagonal  element  of  GD.  Similarly,  since  G  is  a  diagonal 


'  1 


■ 
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matrix,  it  may  be  obtained  in  a  trivial  manner  from  G"\ 

The  eigenvalue  problem  defined  in  (2.22)  involves  a  symmetric 
matrix,  and  hence  is  suitable  for  attempted  solution  by  either  Jacobi's 
method,  Givens'  method,  or  Householder’s  method. 

It  remains  to  determine  T  which  is  defined  in  the  formula  (2.18). 
We  use  an  important  theorem  relative  to  the  eigenvalue  problem  of  sym¬ 
metric  matrices,  namely  that  if  a  matrix  A  is  symmetric  then  there 
exists  an  orthogonal  matrix  Q  such  that 


Q'AQ  -  D  ( 2.29 ) 

where  D  is  a  diagonal  matrix  whose  diagonal  elements  are  the  eigen- 

A 

values  of  A.  In  the  formula  (2.18)  the  matrix  P  is  indeed  symmetric  so 
that  by  applying  a  suitable  method,  the  eigenvalues  as  diagonal  elements 
of  D,  and  the  eigenvectors  as  columns  of  Q,  may  be  determined  to 
satisfy  the  equation 


A 

Q'PQ  =  D  =  (d^.d.j)  (2.30) 

Since  P  is  positive  definite,  the  diagonal  elements  of  D  are  positive. 
Therefore  the  matrix  T  can  be  obtained  from  the  formula 


T  ■ 

'J 


(2.3D 


The  advantage  of  the  above  numerical  technique  is  that  by  deriv¬ 
ation  of  symmetric  matrices  it  is  possible  to  avoid  the  need  for  inver¬ 
sion  of  a  general  matrix  which  would  tend  to  involve  a  large  comput¬ 
ational  error. 

Before  proceeding  with  the  numerical  solution  of  Winters,  the 

/\ 

elements  of  the  matrices  P  and  P  must  of  course  be  estimated  in  terms 


of  the  probabilities  that  a  document  contains  certain  combination  of 


I  '  '  .  '  ■ 
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keywords  as  defined  in  Section  2.2. 

2.4  Summary. 

Use  of  the  latent  class  concept,  and  the  procedure  for  numerical 
solution  of  the  equations  as  described  above,  appears  attractive  as  a 
means  of  determination  of  document  classes.  The  only  probabilities 
required  to  be  known  are  those  that  involve  word  associations  within 
documents.  It  is  not  necessary  to  begin  with  a  subdivision  of  docu¬ 
ments  into  classes  since  this  is  determined  as  a  result  of  the 
numerical  solutions. 

However,  the  above  analysis  is  based  on  the  assumption  that  dis¬ 
joint  sets  of  documents  with  the  required  latent  class  properties  do, 
in  fact,  exist.  It  is  also  supposed  that  such  classes,  if  they  exist, 
have  significance  to  users  of  the  document  data  base. 

If  disjoint  sets  of  documents  with  latent  class  properties  do 
not  exist  for  a  given  data  base,  the  fact  will  be  apparent  in  that  the 
above  procedure  will  not  lead  to  a  meaningful  solution  of  the  account¬ 
ing  equations.  In  order  to  be  meaningful,  a  solution  must  lead  to 
probability  values  that  all  lie  within  the  range  of  0  to  1 .  This 
condition  is  examined  in  the  next  Chapter. 


' 
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CHAPTER  III 

APPLICATIONS  OF  LATENT  CLASS  ANALYSIS 

3 . 1  Application  of  Winters1  Method  Using  Experimental  Data. 

Winters  did  not  perform  any  practical  experiments  to  verify  the 

applicability  of  his  numerical  solution  to  determine  document  classification. 

Instead,  he  gave  an  artificial  example  to  illustrate  the  mathematical 

A  /\ 
techniques  when  H,  H,  and  G  are  4x4  matrices.  The  P  and  P  were  computed 

A  A 

from  the  relations  P  =  H'GHH  and  P  =  H'GH.  Then,  by  application  of  his 
numerical  techniques,  he  examined  whether  the  original  values  resulted  for 

A 

H,  H,  and  G.  In  fact  the  computed  values  were  in  agreement  with  those 
assumed.  This  example  only  proved  that  his  numerical  techniques  were 
valid,  and  that  the  equations  did  not  become  ill-conditioned  for  his 
example  of  4x4  matrices.  He  made  no  attempt  to  find  a  solution  of  the 
equations  that  result  for  matrices  of  higher  order  or  for  matrices  derived 
from  real  document  data. 

We  have  performed  one  experiment  in  which  7006  titles  from  the 
acoustic  literature  were  used  to  compute  the  necessary  probabilities  for 

/A 

P  and  P.  In  our  experiment,  the  probability  p..  was  computed  as  the  prob¬ 
ability  that  a  document  contains  the  same  keyword  K.  twice,  and  also  the 
probability  p..^  was  computed  in  the  same  manner.  Details  of  the  acoustics 
data  base  are  given  in  Chapter  V. 

The  following  six  words  were  selected  arbitrarily  as  keywords: 

1 .  ABSOR 

2.  EAR 


3.  NOISE 


<  i y  ■>  ■:  a  ■  :t  it# 
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4.  SPEEC 

5.  ULTRA 

6 .  WATER 

They  are  listed  in  truncated  form  to  indicate  the  ABSOR  might  denote 
ABSORb,  ABSORption,  etc.,  and  similarly  that  NOISE  might  include  NOISEs, 
etc..  The  form  of  truncation  is  described  further  in  Chapter  V. 

/N 

The  matrices  of  probabilities  P  and  P  were  computed  to  give 


0.032258 

0.002855 

0.000143 

0.001570 

0.0 

0.003283~ 

0.002855 

0.0 

0.0 

0.0 

0.0 

0.001142 

0.000143 

0.0 

0.0 

0.0 

0.0 

0.0 

0.001570 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.003283 

0.001142 

0.0 

0.0 

0.0 

0.0 

’l.O 

0.031402 

0.006709 

0.052098 

0.021696 

0 .084499~ 

0.031402 

0.001285 

0.0 

0.0 

0.0 

0.011704 

0.006709 

0.0 

0.0 

0.000714 

0.000285 

0.0 

0.052098 

0.0 

0.000714 

0.001570 

0.002997 

0.000143 

0.021696 

0.0 

0.000285 

0.002997 

0.000143 

0.0 

_0. 084499 

0.011704 

0.0 

0.000143 

0.0 

0.000571_ 

The  next  step  was  to  derive  the  orthogonal  matrix  T  which  satisfies 
T'PT  =  I.  First,  in  order  to  determine  the  eigenvalues  and  eigenvectors 
of  P,  Householder's  method  was  applied  to  compute  Q'PQ  =  D  where  the 

A 

diagonal  elements  of  the  diagonal  matrix  D  are  the  eigenvalues  of  P.  The 


•J  i 
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corresponding  eigenvectors  appear  as  the  column  vectors  of  Q.  As  a  result 
of  this  computation  it  was  found  that  three  of  the  six  eigenvalues  were 
negative.  This  implies  that  the  matrix  P  is  not  positive  definite.  The 
computed  eigenvalues  and  eigenvectors  are  as  follows: 


0.011309 

0.008733 

0.000015 

-0.000181 

-0.002923 

-0.013384 

-0.994416 

-0.041449 

-0.048236 

0.035995 

0.040402 

0.064534 

-0.031884 

0.686161 

0.340707 

-0.252564 

-0.279383 

0.519848 

-0.006639 

-0.072710 

0.727169 

0.679904 

0.056715 

-0.020214 

-0.051388 

-0.401225 

0.257334 

-0.256817 

-0.813892 

-0.204406 

-0.021490 

-0.247111 

0.527305 

-0.634407 

0.504591 

-0.057784 

-0.083511 

0.547845 

0.092443 

-0.064951 

-0.007965 

-0.824660 

Since  P  is  not  positive  definite,  we  cannot  proceed  to  the  next  stage  of 
Winters'  method  to  evaluate  the  matrix  T. 

/\ 

The  above  example  is  not  exceptional  in  producing  a  matrix  P  that 
does  not  lead  to  determination  of  latent  classes.  Various  choices  of  sets 

A 

of  keywords  have  been  found  to  generally  lead,  either  to  a  matrix  P  that 
is  not  positive  definite,  or  to  determination  of  "probabilities"  that  do 
not  all  lie  within  the  range  0  to  1 . 

However,  even  if  a  set  of  disjoint  latent  classes  does  not  exist, 
there  arises  the  question  as  to  whether  there  exist  classes  that  are 
almost  disjoint,  and  for  which  the  accounting  equations  may  be  approxi¬ 
mately  true.  This  is  investigated  in  the  next  section. 
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3*2  An  Attempt  to  Use  Latent  Class  Analysis. 

3.2.1  General . 

Instead  of  attempting  a  matrix  solution  for  latent  class  analysis, 
the  present  section  presents  a  different  method  to  determine  latent 
classes  and  their  associated  probabilities. 

In  this  new  method  determination  of  the  number  of  latent  classes 
is  still  a  difficult  problem.  As  in  the  method  of  Winters,  we  assume 
that  the  number  of  latent  classes  is  equal  to  the  number  of  keywords. 
There  is  justification  for  this  assumption  since  in  the  special  ins¬ 
tance  that  no  keywords  tend  to  associate,  then  the  number  of  latent 
classes  is  certainly  equal  to  the  number  of  keywords.  Also,  if  the 
number  of  existing  latent  classes  is,  in  fact,  less  than  the  number  of 
keywords,  then  the  probabilities  corresponding  to  the  non-existing 
latent  classes  will  be  computed  as  zeros,  and  the  assumption  will  still 
be  valid.  The  new  numerical  method  will  be  called  the  "minimizing 
method" . 

3.2.2  Numerical  Solution. 

The  original  statement  of  the  problem  of  latent  class  analysis 
involves  solution  of  the  set  of  equations  defined  in  (2.4),  (2.5), 

(2.6)  (2.7)  and  so  forth. 

For  practical  application  however,  it  is  reasonable  to  make  the 
following  assumptions: 

1.  Significant  associations  of  keywords  within  documents  never  in¬ 
volve  sets  of  more  than  three  keywords.  This  means  that  only  p. 's, 
Pjj's,  and  P-jjk's  need  be  considered,  but  not  p.^'s,  and  etc.. 

2.  If  p.j 's  or  p. -k's  are  sufficiently  small,  then  the  equations  (2.6) 


■ 
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and  (2.7)  may  be  neglected. 

In  the  case  of  N  keywords  and  N  latent  classes,  let  us  define  a 
function  F(G,H)  with  N(N+1)  variables,  G  =  (gz)  and  H  =  (h!),  as 


follows: 


(3.1) 


Obviously  F (G,H)  is  a  non-negative  function  and  if,  and  only  if,  each 
term  of  this  function  is  equal  to  zero,  then  the  function  has  a  mini¬ 
mum  value  of  zero.  This  minimum  value  occurs  when  the  g's  and  h's 
correspond  to  the  latent  class  structure  that  satisfies  the  equations 
(2.4),  (2.5),  (2.6),  and  (2.7). 

It  is  obvious  that  the  function  F(G,H)  has  concave  form  at  the 

solution  points,  because  the  partial  second  derivatives  of  F(G,H)  with 

respect  to  the  g's  and  h's  are  always  positive. 

For  the  purpose  of  the  computations,  new  variables  x^  and  y..  and 

i  ij 

y--k  are  defined  as  follows: 


x.  =  h./p. 

1  i  '  r  i 


(3.2) 


(3.3) 


yijk  =  Pijk/PiPjPk 


(3.4) 


The  function  F(G,H)  may  then  be  rewritten  as  the  function 


6  " !  b>  au  3 f  -  ■ 
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F(G,X)  =  (l  g£-l)2 
£=1 
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N 
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N 

l 


N 

d 


j=l  £=1 


gW-y..)2 
i  j  ij 


N 

l 


N 

I 


N 

I 


(I 


i=l  j=l  k=l  £=1 


(3.5) 


o  o 

The  first  two  sets  of  summations  will  be  zero  if  g  and  x.  are  chosen 
so  that 


9 


N 


=  1 


g 


i 


(3.6) 

(3.7) 


where  the  subscript  i  varies  from  1  to  N. 

If  the  conditions  (3.6)  and  (3.7)  are  satisfied,  then  the  function 
F(G,X)  to  be  minimized  may  be  reduced  to 


(3.8) 


Among  the  possible  numerical  techniques  to  solve  the  minimization 
problem,  the  method  of  steepest  descent  (21)  is  suggested.  It  may  be 
described  as  follows.  In  the  neighborhood  of  the  solution,  the  function 

A  /V. 

F(G,X),  say  F(Z),  has  a  concave  surface.  If  an  initial  value  Zq  is 

A 

chosen  close  to  the  solution  of  F(Zj  =  0  then  a  better  approximation 
Z_-|  is  supposed  in  the  form 


-1  ^0  +  210d0 


(3.9) 


where  forms  a  vector  of  step  size,  and  d^  indicates  the  direction  of 

A 

steepest  descent.  In  general  d^  is  defined  as  grad  F(Z-)  so  that  (3.9) 


becomes 


(3.10) 


In  the  computation,  all  elements  of  the  vector  a.  are  assumed  to  be 


constant,  and  coordinate  axes  are  used  in  place  of  grad  F(Z^ ). 


For  the  first  approximation  to  the  solution  of  F(G,X)  -  0,  the 
following  values  can  be  chosen. 


(3.11) 


U=1,N) 


(3.12) 


( £=1 ,N  and  i=l ,N) 


which  satisfy  the  relations  in  (3.6)  and  (3.7). 

In  the  iteration  procedure,  the  next  approximates  are  calculated 
by  adding  a  small  perturbation  a  or  -a  to  the  previous  approximations  so 
that  F(G,X)  can  be  decreased.  The  iteration  will  be  repeated  until  the 
value  of  F(G,X)  becomes  sufficiently  small  or  until  a  given  number  of 
iterations  is  completed. 

It  should  be  noted  that  the  method  described  may  lead  only  to 
approximate  solutions  of  the  latent  class  equations  since,  in  fact,  it 
may  happen  that  no  real  solutions  exist.  However,  approximate  solutions 
may  be  quite  satisfactory  in  practice  since  it  is  not  a  serious  problem 
if  there  is  slight  overlap  between  the  different  classes  of  documents. 

3.2.3  Application  of  Proposed  Method. 

Using  the  numerical  techniques  stated  in  the  previous  section,  a 
sample  computation  was  performed.  Given  a  sample  solution  for  6  key¬ 
words  and  6  classes,  the  probabilities  were  computed  backwards.  The 
sample  solution  and  modified  probabilities  were  chosen  in  Table  3.1. 
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Table  3.1  A  Sample  Solution  of  the  Accounting  Equations 
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0  0 

Substituting  the  initial  approximation  g  =  1/6  and  x.  =  1,  the 

/\ 

function  F(G,X)  was  found  to  have  a  value  of  6.259737.  After  2550 

A 

iterations,  the  value  of  F(G,X)  was  reduced  to  0.005164,  and  the 
resulting  values  of  g's  and  x's  are  as  shown  in  Table  3.2. 

The  labelling  of  the  latent  classes  of  this  approximation  is,  of 
course,  not  necessarily  in  the  same  order  as  those  of  Table  3.1. 

Comparison  of  the  two  tables  indicates  that  g^  of  Table  3.2  corresponds 

2  26314553  6 

to  g  of  Table  3.1,  g  to  g  ,  g  to  g  ,  g  to  g  ,  g  to  g  ,  and  g  to 

g^.  The  correspondence  is  shown  in  Table  3.3. 

It  is  apparent  from  Table  3.3  that  the  solution  is  considerably 

different  from  the  exact  values.  In  order  to  increase  the  accuracy, 

more  iterations  are  needed.  However,  this  is  not  an  easy  task,  because 

A 

at  a  high  number  of  iterations  the  value  of  F(G,X)  is  apt  to  oscilate 
and  does  not  converge  smoothly.  The  rate  of  convergence,  and  tendency 
to  oscilate,  is  dependent  on  the  choice  of  the  constant  scaling  factor 
which  denotes  the  step  size  for  the  next  iteration.  Therefore,  as  the 
iterations  proceed,  in  order  to  improve  the  rate  of  convergence  the 
value  of  the  step  size  must  be  suitably  changed  as  necessary. 

3.2.4  Discussion. 

The  sample  calculations  of  the  previous  section  illustrates  that, 
even  if  estimations  of  step  size  are  made  at  certain  stages,  the 
iterations  must  be  repeated  many  times  in  order  to  obtain  a  solution 
with  acceptable  accuracy.  This  problem  may  not  be  very  serious  in  the 
sample  calculations  which  applied  to  only  6  keywords  and  6  latent 
classes.  However,  in  any  practical  instance  in  which  there  may  be 
hundreds  of  keywords,  the  method  involves  a  lot  of  multiplications  to 
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Table  3.2  Approximates  Given  by  Minimizing  Method 
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Table  3.3  Comparison  of  Postulated  g£  and  with  Computed  Values  (in  parentheses) 
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compute  the  function  F(G,X)  so  that  the  accuracy  of  the  approximated 
values  is  doubtful.  Furthermore,  the  enormous  number  of  iterations 
that  must  be  performed  make  the  method  very  expensive  in  computer  time 

3.3  Conclusions  Regarding  the  Limitations,  or  Unsuitability,  of 

Latent  Class  Determination. 

3.3.1  Winters'  Method. 

Two  different  methods  have  been  described  and  applied  to  the 
solution  of  the  system  of  non-linear  equations  which  define  the  latent 
class  structure.  First,  Winters'  numerical  technique  was  applied  dir¬ 
ectly  to  determine  the  latent  classes  for  an  existing  data  base  of 
acoustic  literature.  The  method  was  not  successful  because  the  assump 
tion  that  P  is  positive  definite  did  not  hold  in  this  case.  In  fact  P 
had  three  positive  eigenvalues  and  three  negative  eigenvalues.  In 

order  to  obtain  T  it  is  necessary  to  compute  the  square  roots  of  the 

/\ 

eigenvalues  of  P,  and  if  these  eigenvalues  are  negative,  it  is  impos¬ 
sible  to  obtain  a  real  T.  The  definiteness  of  P  was  not  checked,  but 
P  might  not  be  positive  definite  either.  If,  and  only  if,  the  con¬ 
ditions  (2.15)  and  (2.16)  hold,  then  the  Winters'  techniques  can  be 
applied.  However  an  arbitrary  data  base  does  not  in  general  satisfy 
these  conditions,  and  so  a  latent  class  structure  does  not  generally 
exist.  This  implies  that,  in  general,  keywords  do  not  occur  statis- 
tically  independently  in  each  class,  and  so  h. .  f  h.h^. 

An  attempt  to  use  the  Winters'  method  while  avoiding  the  above 
difficulties  might  proceed  as  follows: 

1.  Select  a  few  keywords  arbitrarily. 


znofrJB-is+r  do  ngdnun  auonnons  S(L  'onm.tt'tu  • 


drods  et*  •  f  ji  *•:>■'•»  rl  .no’rdfiup!-  'U-anH'-i.  >n  to  ,  v  :n  ioi  i  (  ■ 

do  ozfid  eJ -ft  qnrdarxe  n  nod  29226fD  dnedi!  9;1.  fn  rm$d9l  o;  (.fj'i? 
dp  s  i  52 i.  n-  2  o  u  -e/v  oorl  .  i  9*1;  t  e  ^.s  ub 

.9260  ar  rid  nr  >f  rl  on  brb  idrr  dob  9V I  *  i  o  ar  end  rj 
.29ul6vn90r9  9V i  5  non  oenrld  bne  29uf&vnr,jf9  ovrd'raoq  so-trld  bf>rl 
9rld  do  adooo  9oeup2  orid  9  uqmoo  o.*  ^fcaaeosn  ,;r  dl  T  nk.  do  r  nsb'io 


dud  rb9>b9rto  don  zi  i  9  do  229nsdrn  erlT 


N  10  i  .  -v  ;U  li  ,vrd  o,  •  !  do*  •  1 
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2.  Compute  P  and  P. 

A 

3.  If  both  P  and  P  are  positive  definite,  then  add  more  keywords 
and  go  to  2.  Otherwise,  reject  some  keywords  and  add  others, 
and  go  to  2. 

The  above  step  might  be  continued  until  realisable  latent  class 
structure  results.  Of  course,  even  if  such  a  procedure  is  applicable 
to  a  practical  data  base  that  involves  hundreds  of  keywords,  it  may  be 
very  time  consuming  and  expensive.  Our  investigations  have  provided  no 
evidence  to  suggest  the  practical  feasibility  of  determining  the  latent 
class  structure  of  a  data  base  that  contains  several  hundreds  of  key¬ 
words.  There  is  also  an  inconsistency  between  the  definitions  and  the 
numerical  approach  by  Winters.  Recalling  the  definitions  of  latent 
class  structure  in  (2.5),  (2.6),  and  (2.7) »  the  subscripts  i,  j,  and  k 
are  defined  to  have  unequal  values  and  so  the  p. . 's  and  p - ^ ^ ' s  are 

undefined.  However  these  undefined  terms  do  appear  as  diagonal  elements 

/\ 

of  the  matrices  P  and  P,  and  hence  direct  application  of  the  Winters' 
analysis  is  not  possible.  Winters  did  not  mention  this  fact  in  his 
paper  (24).  This  is,  however,  a  minor  problem,  and  may  be  overcome  by 
changing  the  definition  such  that  the  subscripts  i,  j,  and  k  may  be  the 
same.  The  probability  p..  may  be  defined  as  the  probability  that  a 

h 

document  contains  at  least  two  i  keywords,  and  the  probability  p.^ 
as  the  one  that  a  document  contains  at  least  two  ith  and  one  Nth 
keywords.  This  change  of  definition  does  not  destroy  the  latent  class 
structure,  because  it  is  not  irrational  to  apply  the  independence 
assumption  to  h^.'s  and  h^.^'s  such  that  h^.  =  h^h^  and  h^.^  =  h^h^h^j. 

According  to  the  formula  (2.9)  which  evaluates  the  ordering  ratio, 
every  document  is  classifiable,  even  ones  with  no  keywords.  This  appears 
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to  be  at  variance  with  the  intuitive  idea  that  the  absence  of  keywords 
implies  no  information,  and  hence  such  documents  cannot  be  classified. 

We  suggest  one  alternative  to  the  above.  Change  the  formula  (2.9) 
which  calculates  the  ordering  ratio,  to  the  form 


g  h-j  ■  •  •  •  h 

V  h 

L  9  h,h9. . . .h 

k=l  1  L 


M 


k 

M 


(3.13) 


which  neglects  the  non-existing  keywords.  This  formula  is  more  easily 
computed  than  the  one  in  (2.9),  since  formula  (3.13)  requires  only 
M(N+1)  multiplications  where  M  is  the  number  of  keywords  in  a  given 
document.  In  contrast,  the  formula  (2.9  )  requires  N(N+1)  multiplic¬ 
ations  for  every  case. 


3.3.2  Minimizing  Method. 

The  other  approach  proposed  in  Section  2  to  solve  latent  class 
equations  is  the  minimizing  method  in  which  the  necessary  probabilities 

/N 

are  computed  to  minimize  the  positive  function  F(G,X).  The  sample  com¬ 
putations  were  for  6  keywords  and  6  latent  classes.  Even  after  2550 
iterations,  with  about  14  minutes  execution  time,  the  approximation  was 
not  close  to  the  exact  solution.  Thus,  the  iterative  procedure  does 
not  provide  an  economic  solution  to  the  problem. 


3.3.3  Summary. 

A  fundamental  question  concerning  latent  class  analysis  is  whether 
there  exist  such  latent  classes  for  an  arbitrary  group  of  documents. 
Since  the  required  probabilities  are  estimated  with  possibility  of  some 
numerical  error  because  of  finite  sampling,  it  is  very  unsatisfactory 
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to  have  latent  class  determination  dependent  on  methods  whose  results 
are  affected  by  small  changes  in  the  numerical  data. 

In  general,  the  latent  class  analysis  involves  too  many  unknowns, 
g's  and  h's,  and  it  requires  a  large  system  of  non-linear  equations. 

In  fact,  solution  of  such  equations  poses  a  problem  in  numerical  anal¬ 
ysis,  and  it  is  not  clear  that  the  resulting  latent  classes  are 
sufficiently  well-defined  to  be  useful  in  document  classification. 
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CHAPTER  IV 

ATTRIBUTE  ANALYSIS 


4 . 1  Classification  by  Attribute  Number. 

On  the  basis  of  Luhn's  pioneer  work  (10,  11),  in  1961  M.  E.  Maron 
applied  statistical  techniques  to  the  problem  of  automatic  classifi¬ 
cation.  He  derived  a  formula  based  on  probabilities  of  word  occurrences 
and  subject  categories.  He  used  the  computer  to  evaluate  the  probability 
that  a  document  which  contains  a  certain  combination  of  keywords  also 
belongs  to  a  certain  category.  In  addition  to  developing  prediction 
formulas  based  on  probabilities,  he  carried  out  experimental  work  which 
may  be  used  as  the  basis  to  determine  the  direction  of  further  studies. 

Maron' s  prediction  formula  for  classification  is  based  entirely  on 
the  statistical  associations  between  categories  and  certain  keywords  in 
documents.  Suppose  that  a  document  contains  only  one  keyword  K. .  Then 

i.  L. 

the  probability  that  this  document  belongs  to  the  k  category  is 
expressed  by 


P(Ck;K.) 


P(Ki ;Ck)P(Ck) 

PXkTJ 


(4.1) 


th 

where  P(K.;C.  )  is  the  probability  that  a  document  in  the  k  category 

1  K 

contains  the  i^*1  keyword  K^. .  The  term  P ( C ^ )  is  the  probability  that 
a  document  is  in  the  category  C^,  and  the  term  P(K..)  is  the  probability 
that  a  document  contains  the  keyword  K . . 

The  value  of  P(C.;K.)  indicates  the  degree  of  association  of  the 

K  1 

given  document  with  the  k^  category.  Therefore,  if  regarded  as  a  func¬ 
tion  of  Ck,  the  function  P(Ck;K.)  has  its  largest  value  at  k  =  9,  then 


' 
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"t  h 

the  9  category  is  the  most  suitable  category  for  the  document. 


M 

More  generally,  suppose  that  a  document  has  M  keywords  The 

probability  that  the  document  belongs  to  the  category  is  then 


P(Ck;{Ki}f;) 


p({k1 )^;ck)p(ck) 

P( {Ki 


(4.2) 


M 

Since  P({K^}^)  is  independent  of  the  choice  of  the  categories  the  above 
expression  may  also  be  written  in  the  form 


P(Ck;{K.}”)  =  kP({K1)”;Ck)P(Ck) 


(4.3) 


where  k  is  a  constant  which  is  independent  of  the  choice  of  categories. 

In  order  to  simplify  the  computations  Maron  made  the  important 
assumption  that  in  each  category  the  keywords  occur  in  a  statistically 
independent  manner.  Then  (4.3)  may  be  further  simplified  to  become 


(4.4) 


and  P(Cj<,;{K.}^)  is  then  called  an  "attribute  number",  tt  signifies  the 


multiplication  of  terms  as  i  ranges  from  1  to  M. 

4.2  Selection  of  Data. 

In  his  research  Maron  chose  an  experimental  data  base  of  some  405 
abstracts  chosen  from  the  computer  journal  literature.  He  attempted  to 
classify  the  documents  by  automatic  processing  of  the  abstracts.  These 
abstracts  are  in  the  IRE  Transactions  on  Electronic  Computers,  vol .  EC-8, 
no.  1,  published  by  the  IRE  Professional  Group  on  Electronic  Computers 
in  1959. 

The  405  abstracts  were  divided  into  two  groups.  Group  1  consisted 
of  260  abstracts  which  were  published  in  the  March  and  June  issues  of 
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1959.  Group  2  consisted  of  145  abstracts  which  were  available  in  the 
September  issues  of  1959.  Group  1  formed  the  data  base  for  the  com¬ 
putation  of  the  statistical  values  required  for  the  automatic  classifi¬ 
cation.  Group  2  was  used  to  test  the  theory  of  the  category  prediction 
based  on  use  of  the  statistical  data  collected  from  Group  1.  Therefore 
Group  2  was  not  considered  until  all  the  statistical  procedures  had  been 
performed  on  Group  1.  The  260  abstracts  in  Group  1  contained  more  than 
20,000  words,  3,263  different  words,  and  the  average  number  of  words  in  a 
document  abstract  was  79. 

4.3  Selection  of  Categories. 

The  IRE  had  its  own  categories  for  the  classification  of  computer 
literature.  They  consisted  of  10  categories  and  about  15  subcategories. 
However,  Maron  considered  that  the  IRE  categories  were  not  distinct 
enough  to  be  used  as  a  test  of  his  procedures,  and  so  he  grouped  the 
documents  among  32  subject  categories.  The  260  documents  of  Group  1 
were  then  manually  classified  into  the  supposed  proper  categories.  Most 
of  the  documents  fell  naturally  into  a  single  category,  but  about  20/o 
of  the  documents  belonged  to  two  categories,  and  some  of  them  belonged 
to  three  categories. 

4.4  Selection  of  Keywords. 

In  his  work  Maron  used  90  keywords  and  formulated  a  theoretical 
analysis  to  relate  documents  and  keywords  as  described  below. 

According  to  Shannon's  theory  of  entropy,  the  average  uncertainty 
H  with  which  a  document  may  be  assigned  to  a  category  is 
32 

H  =  -  l  P(Ck)log2P(Ck) 
k=l 


(4.5) 
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where  P ( )  denotes  the  probability  that  a  document  belongs  to  the 

.  th  . 
k  category. 

Suppose  that  a  document  is  indexed  by  one  word,  say  W . .  Then  the 
average  uncertainty  H.  that  the  document  belongs  to  any  one  of  the  32 
categories  can  be  represented  by 
32 

H1  =  -  l  P(Ck;W.)log2P(Ck;W.)  (4.6) 

k  1 

where  P(C^;W. )  is  the  probability  that  a  document  keyworded  by  the  word 

i  U 

W.  belongs  to  the  k  category  C^. 

Since  the  difference  H  -  H.  is  the  uncertainty  removed  by  the 
selection  of  the  word  W.  as  a  keyword,  the  keywords  should  be  decided  by 
computing  H  -  H.  for  all  words  that  appear  on  the  data,  and  by  ranking 
the  resulting  values  in  decreasing  order.  Such  a  list  then  shows  the 
order  of  efficient  keywords. 

However,  Maron  did  not  follow  this  method  to  determine  the  90  key¬ 
words  from  Group  1,  Instead,  he  first  removed  the  55  function  words 
(e.g.  the,  of,  a,  etc.)  which  had  a  total  of  8,402  occurrences.  Thus, 
about  2%  of  the  different  words  accounted  for  over  40%  of  the  total 
occurrences.  He  also  removed  frequently  occurring  words  (e.g.  computer, 
data,  system,  etc.),  and  the  2,120  rarely  occurring  words  used  only 
once  or  twice  and  which  accounted  for  65%  of  the  total  3,263  different 
words.  About  1,000  different  words  remained  as  possible  keywords. 

Among  them,  90  words  were  each  found  to  occur  predominantly  in  a  single 
category,  and  were  considered  to  be  suitable  choices  for  keywords  for 
automatic  classification. 

In  the  present  thesis,  which  describes  techniques  applied  to  a 
data  base  formed  from  the  acoustic  literature,  we  do  not  follow  the 


. 
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method  of  keyword  selection  used  by  Maron.  Our  method  is  described  in 
detail  in  Chapter  V. 

4.5  Experimental  Results. 

Maron 's  experiment  was  divided  into  two  separate  parts.  The  first 
was  applied  to  documents  of  Group  1.  The  second  was  applied  to  docu¬ 
ments  of  Group  2.  As  previously  stated,  all  necessary  data  (90  keywords 
and  the  value  of  P^J's  and  P(K.  ;C^)'s)  was  determined  by  use  of 
documents  of  Group  1.  Therefore,  the  results  of  Group  2  provided  infor¬ 
mation  for  discussion  of  the  generality  of  Maron's  method  and  suggestion 
of  future  extension  of  work  in  automatic  classification. 

As  may  be  seen  from  the  prediction  formula  (4.4),  if  at  least  one 
of  the  numbers  P(K. is  zero,  then  the  attribute  number  becomes  zero. 
In  order  to  avoid  this  disadvantage,  Maron  assigned  a  very  small  value 
(viz.  0.001)  to  replace  the  zero  values  of  the  P( ) .  This  technique 
proved  very  useful  in  the  classification  of  Group  2.  Because  the  values 
of  P(K^ iC^)  were  computed  only  from  documents  of  Group  1,  some  of  the 
P(K..  *,0^)  needed  to  compute  the  attribute  numbers  in  Group  2  were  not 
available  but  were  approximated  by  the  assumed  small  value. 

The  results  obtained  by  Maron  are  summarized  in  Table  4.1.  It  may 
be  noted  that  for  the  documents  of  Group  1,  the  attribute  analysis  met¬ 
hod  worked  just  as  well  as  the  manual  judgments.  Of  the  247  documents 
available  for  automatic  classification  there  were  209  for  which  the  com¬ 
puter  correctly  assigned  the  largest  attribute  number.  As  described  in 
Section  4.3,  the  manual  examinations  could  not  place  20%  of  the  docu¬ 
ments  into  just  one  category.  This  suggests  that  about  20%  of  uncert¬ 
ainty  is  likely  to  be  involved  in  any  type  of  classification  of  those 
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documents.  Thus,  it  may  be  regarded  as  surprisingly  good  that  84.6% 
of  documents  in  Group  1  were  classified  under  correct  categories  by 
the  computer.  In  his  paper  Maron  did  not  give  complete  details  regard¬ 
ing  classification  of  Group  2.  Thus  we  cannot  discuss  his  results 
precisely.  However,  the  figures  available  in  Table  4.1  for  documents 
that  contain  more  than  one  keyword  show  that  44  out  of  85  documents 
were  correctly  classified.  Thus  approximately  50%  of  documents  were 
correctly  classified  under  only  one  category. 

4.6  Summary. 

The  value  of  Maron's  pioneer  work  on  automatic  classification  is, 
not  only  that  he  used  a  statistically  based  classification  system  to 
successfully  derive  the  proper  category  for  a  document,  but  that  he 
also  introduced  the  concept  of  "attribute  number"  to  describe  the  degree 
of  association  between  a  given  document  and  category.  Maron  made  the 

statement  that  " _ _  instead  of  stating  that  either  a  document  belongs 

to  a  given  category  or  not,  it  would  be  more  realistic  to  recognize 
that  a  document  can  belong  to  a  category  to  a  degree  (i.e.,  with  a 
weight)."  The  degrees  are,  in  fact,  indicated  by  the  set  of  attribute 
numbers . 

In  summary,  the  experimental  results  of  Maron  are  sufficiently 
encouraging  for  us  to  proceed  to  modify,  and  attempt  to  improve, 

Maron's  method  of  attribute  analysis  which  forms  the  foundation  of  a 
statistical  approach  to  relationships  between  keywords  and  categories. 

It  should  be  noted  that  the  keywords  used  by  Maron  were  chosen  from 
the  words  that  appeared  in  the  documents  abstracts.  An  automatic  choice 
of  keywords  therefore  requires  that  the  abstracts  be  available  in  machine 
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Table  4.1  Summary  of  Maron's  Results 
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readable  form.  Computer  processing  of  abstracts  is,  of  course,  more 
costly  than  similar  processing  of  document  titles,  and  there  arises  the 
question  as  to  whether  an  efficient  choice  of  keywords  could  be  based 
on  processing  of  title  words  only.  This  is  one  of  the  questions 
considered  in  the  subsequent  chapters. 
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CHAPTER  V 

ACOUSTICS  DATA  BASE  AND  SELECTION  OF  KEYWORDS 
5.1  Selection  of  Data. 

The  data  base  used  to  perform  our  experiments  consists  first  of 
1572  titles  of  papers  published  in  the  Journal  of  the  Acoustical  Society 
of  America  (JASA)  in  1966,  1967,  1968,  and  1961. 

An  individual  datum  consisting  of  journal  name,  year,  volume  number, 
page,  authors,  and  title  is  punched  on  cards  to  provide  the  data  base 
accessible  to  the  computer.  The  author  and  title  words  are  truncated  to 
five  letters.  A  detailed  description  of  this  acoustics  data  is  given  in 
JASA  vol.  43,  no.  6  (7).  Although  truncation  might  be  undesirable  in  a 
data  base  used  for  information  retrieval,  it  has  no  effect  on  the  valid¬ 
ity  of  the  results  of  the  present  thesis. 

The  1572  titles  are  divided  into  four  groups.  Group  1  consists  of 
395  titles  which  were  all  published  in  1966.  Group  2  consists  of  385 
titles  which  were  published  in  1967.  Group  3  consists  of  506  titles 
which  were  published  in  1968.  Group  4  consists  of  286  titles  which 
were  published  in  1961.  Group  1  will  be  used  as  the  base  data  for 
choice  of  200  keywords  and  to  estimate  probabilities,  and  necessary 
values,  required  for  the  experimental  model.  The  automatic  classific¬ 
ation  schemes  will  be  tested  over  group  1,  group  2,  group  3,  and 
group  4  separately,  and  the  results  will  be  compared  with  those  of 

previous  researchers. 

5.2  Selection  of  Categories. 

JASA  has  prepared  16  main  subject  categories  to  classify  articles 
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issued  by  JASA.  Each  of  the  16  categories  has  been  further  divided 
into  several  sub-categories. 

In  our  experimental  investigation  the  JASA  sub-categories  are  not 
used  because  they  are  too  precise  to  distinguish  concepts  of  articles. 
Furthermore,  of  the  16  main  categories,  the  categories  1,  3,  and  8  are 
not  used  because  very  few  articles  have  been  issued  in  these  subject 
categories.  Thus,  13  main  categories  out  of  16  are  used  in  our  experi¬ 
ments.  An  additional  category  is  provided  to  classify  articles  which 
cannot  be  classified  under  either  of  the  13  categories.  Thus  14  main 
categories  are  renumbered  and  are  as  listed  below: 

1.  Architectural  Acoustics. 

2.  Physiological  and  Psychological  Acoustics. 

3.  Acoustical  Instruments  and  Apparatus. 

4.  Music  and  Musical  Instruments. 

5.  Noise  and  Noise  Control. 

6.  Speech  Communication. 

7.  Ultrasonics. 

8.  Radiation  and  Scattering. 

9.  Mechanical  Vibrations  and  Shock. 

10.  Underwater  Sound. 

11.  Aeroacoustics,  Macrosonics. 

12.  Acoustic  Signal  Processing. 

13.  Bioacoustics. 

14.  Miscellaneous. 

The  titles  used  in  the  experiment  are  manually  classified  following 
the  above  classification  schedule  accepting  the  JASA  assignment  of 
category,  and  the  indication  of  subject  category  is  punched  on  each  data 
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card  that  describes  the  document. 

5.3  Selection  of  Keywords. 

M.E.  Maron  suggested  a  method  to  select  keywords  for  the  class¬ 
ification  system.  The  method  is  based  on  Shannon's  theory  of  entropy. 

The  detailed  description  of  this  method  is  stated  in  Chapter  IV.  The 
direct  application  of  Maron's  suggestion  to  the  keyword  selection, 
however,  raises  a  problem  which  may  be  illustrated  as  follows. 

According  to  Maron's  theory,  the  uncertainty  of  the  correct 
classification  of  each  word  of  a  document  has  its  minimum  value  of  0 
if,  and  only  if,  the  word  occurs  only  in  documents  of  a  particular  sub¬ 
ject  category.  This  theory  completely  neglects  the  frequency  of  the  word 
occurrence.  For  example,  two  different  words,  W^.  and  Wj,  which  occur 
10  times  and  20  times  respectively  in  documents,  may  have  the  same  zero 
value  of  the  uncertainty,  because  they  each  occur  in  documents  of  a 
particular  category.  However,  the  word  will  classify  10  documents 
correctly,  while  the  word  Wj  will  similarly  classify  20  documents. 

Therefore  the  word  W-  should  be  considered  to  be  the  better  indication 

J 

of  subject  category  than  the  word  W.. 

In  the  present  section  the  method  of  keyword  selection  emphasizes 

the  degree  of  accuracy  of  the  total  automatic  classification  system. 

For  documents  that  contain  the  word  W.,  let  N(Ck,W.. )  denote  the  number 

of  documents  that  should  be  classified  in  category  C. .  Then  £  N(C.,W,) 

t^k 

is  the  number  of  documents  classified  under  3  category  C^.  different  from 
Ck' 

First  suppose  that  only  one  keyword  is  used  to  index  each  document. 
To  place  every  document  indexed  by  word  W-j  under  category  produces 
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N(C.  ,W.)  correct  document  classifications  but  produces  l  N(C.,W.) 

t^k  1  1 

incorrect  classifications.  The  value  N (C.  ,W.)  -  £  N(C+,W.)  is  the 

K  1  t^k  1  1 

difference  between  the  number  of  correct  classifications  and  the  num¬ 
ber  of  incorrect  classifications,  and  it  is  therefore  a  measure  of  the 
appropriateness  of  the  word  W.  as  a  single  keyword  to  describe  the 
class  C^. 

In  the  selection  of  keywords,  the  difference  N(C, ,W.)  -  l  N(C.,W.) 

K  1  t^k  1  1 

should  be  made  as  large  as  possible.  Of  course  even  to  make  this  value 

positive  is  not  always  possible.  For  example,  some  very  common  words 

such  as  ACOUSTIC  and  SOUND,  etc.  in  the  present  data  are  distributed 

approximately  uniformly  throughout  all  14  categories,  and  so  their 

differences  may  be  negative.  Thus,  commonly  occuring  words  will  tend 

to  be  automatically  eliminated  from  consideration  as  keywords. 

Each  selected  keyword  W.  should  make  the  function  FlC^)  =  NCC^W.) 

-  I  N(C.,W.)  have  a  sharply  defined  peak  at  some  C.  . 
t/k  1 

The  proposed  method  of  selection  of  keywords  may  be  illustrated  by 
reference  to  Table  5.1  which  indicates  the  frequencies  of  words  in 
categories  for  6  keywords  and  14  categories. 

The  first  column  of  Table  5.1  indicates  that  documents  containing 
the  keyword  ACOUS  occur  more  frequently  in  C-jg  than  in  any  other  categ¬ 
ory.  Thus,  if  all  documents  that  contain  ACOUS  are  to  be  assigned  to  a 
single  category,  then  the  category  should  be  chosen  as  C-j Q .  Of  the  42 
documents  that  contain  ACOUS,  this  classification  will  classify  12 
documents  correctly  and  30  incorrectly.  Therefore  the  number  of  correct 
classifications  exceeds  the  number  of  incorrect  classifications  by  -18. 
Similarly,  the  second  column  of  Table  5.1  shows  that  documents  contain¬ 
ing  the  keyword  BINAU  are  most  frequently  in  category  C2- 


‘ 
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Table  5.1  Word  Frequency  Table  Used  for  Keyword  Selection 


ACOUS 

BINAU 

DEEP 

SOUND 

VI  BRA 

WIDE 

C1 

2 

0 

0 

2 

0 

0 

C2 

5 

9 

0 

7 

1 

0 

C3 

1 

0 

0 

2 

1 

0 

C4 

0 

0 

0 

2 

1 

0 

C5 

0 

0 

0 

0 

0 

0 

C6 

2 

0 

0 

4 

0 

0 

C7 

7 

0 

0 

6 

2 

1 

C8 

8 

0 

0 

8 

2 

0 

C9 

0 

0 

0 

0 

28 

0 

C10 

12 

0 

6 

17 

0 

0 

cn 

2 

0 

0 

7 

1 

0 

C12 

2 

0 

0 

0 

0 

0 

C13 

0 

0 

0 

0 

1 

0 

C,  * 

1 

0 

0 

2 

0 

0 

14 

Correctly 

12 

9 

6 

17 

28 

1 

Incorrectly 

30 

0 

0 

40 

9 

0 

Difference 

-18 

9 

6 

-23 

19 

1 

I, 
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The  final  row  of  Table  5.1  suggests  that  keywords  BINAU,  DEEP, 
VIBRA,  are  better  category  indicators  than  is  WIDE.  Also,  the  keywords 
ACOUS  and  SOUND  are  poor  choices  of  keywords  for  indication  of  category. 

The  above  procedure  was  used  to  choose  keywords  from  the  395 
acoustic  titles  from  1966.  The  titles  and  categories  were  input  through 
a  computer  program  which  computed  the  difference  values  as  in  Table  5.1 
and  then  selected  the  keywords  that  corresponded  to  the  200  largest 
differences.  The  resulting  keywords  are  listed  in  Appendix  A.  Author 
names,  as  well  as  title  words,  were  allowed  as  keywords  since  it  was 
not  wished  to  exclude  the  possibility  that  certain  author  names  might 
be  very  indicative  of  subject  matter. 

5.4  Statistics  on  Data. 

The  attributes  of  group  1  form  the  basis  for  the  prediction  about 
the  attributes  of  group  2  and  group  3.  The  statistical  nature  of  group 
1  is  described  below. 

The  395  titles  in  group  1  contain  a  total  of  3,231  words,  and  the 
average  number  of  words  per  title  is  about  8.2.  There  are  1,327 
different  words  contained  in  the  titles  and  therefore  each  word  occurs, 
on  the  average,  in  two  or  three  titles. 

The  titles  in  group  1  were  pre-classif ied  under  14  subject 
categories  as  summarised  in  Table  5.2. 

In  Table  5.2,  the  large  numbers  that  appear  under  categories  2, 

7,  8,  9,  and  10  indicate  that  in  1966  there  were  many  papers  that  re¬ 
lated  to  these  particular  five  subject  fields.  It  is  interesting  to 
list  the  similar  statistics  for  group  2  in  order  to  see  the  changes  in 
research  interest.  The  figures  of  group  2  are  shown  in  Table  5.3. 

Comparing  the  figures  in  Table  5.2  and  Table  5.3,  it  is  noticed 
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Table  5.2  Distribution  of  Number  of  Titles  in  Group  1  (1966)  over  14  Categories. 
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that  in  1967  subject  fields  3  and  7  which  are  Acoustical  Instruments  and 
Apparatus  and  Ultrasonics  were  becoming  more  popular,  but  subject  fields 
5,  9,  and  11  which  are  Noise  and  Noise  Control,  Mechanical  Vibration, 
and  Shock  and  Aeroacoustics  Macrosonics  were  becoming  relatively  less 
popular  than  in  1966.  This  fact  provides  a  warning  that  in  attribute 
analysis  it  may  be  very  dangerous  to  use  a  partial  set  of  articles  as 
a  base  data  to  predict  the  attributes  of  the  whole  data. 

There  were  82  titles  in  group  1,  102  titles  in  group  2,  143  titles 
in  group  3,  and  96  titles  in  group  4  that  did  not  have  any  of  the  200 
selected  keywords,  and  the  rest  of  the  titles  contained  at  least  one, 
and  up  to  six,  keywords.  The  Table  5.4  gives  the  figures  regarding 
the  number  of  keywords  in  titles. 

From  Table  5.4,  it  follows  that  each  title  in  group  1  contains  an 
average  of  1.8  keywords,  each  title  in  group  2  and  group  3  contains  an 
average  of  1.3  keywords,  and  each  title  in  group  4  contains  an  average 
of  1.2  keywords. 
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Table  5.4  Number  of  Keywords  in  Titles. 


Number  of  | 

Keywords 

Group  1 

Group  2 

Group  3 

Group  4 

0 

82 

102 

143 

96 

1 

100 

147 

184 

99 

2 

103 

91 

117 

53 

3 

61 

25 

42 

29 

4 

26 

16 

13 

8 

5 

11 

2 

5 

1 

6 

12 

2 

2 

0 

Total  of  Keyw( 
Occurrences 

3rd 

|  720 

490 

633 

329 
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CHAPTER  VI 

APPLICATION  OF  MARON'S  ATTRIBUTE  ANALYSIS  TO  ACOUSTICS  DATA  BASE 

6.1  Experimental  Results  on  Acoustic  Data. 

In  Chapter  V  we  have  described  in  detail  the  acoustics  data  base 
and  the  14  categories  available  for  experimentation.  A  method  of 
selection  of  keywords  was  described.  This  is  the  method  used  to  choose 
the  200  keywords  referred  to  in  the  present,  and  subsequent,  chapter. 

Following  Maron's  scheme,  the  395  titles  in  the  1966  issues,  the 

385  titles  in  the  1967  issues,  the  506  titles  in  the  1968  issues,  and 

the  286  titles  in  the  1961  issues,  are  disjoined  to  form  group  1,  group 

2,  group  3,  and  group  4,  respectively. 

As  shown  in  Table  6.1,  in  group  1  there  were  82  out  of  395  titles 
which  did  not  contain  any  of  the  chosen  200  keywords;  therefore  auto¬ 
matic  classification  could  not  be  undertaken  for  these  82  titles.  At 
least  one  keyword  appeared  in  the  remaining  313  titles  which  were  there¬ 
fore  regarded  as  suitable  for  classification  by  the  method  of  Maron. 

For  the  titles  that  contained  only  one  keyword  the  automatic  classifi¬ 
cation  process  predicted  the  correct  categories  in  79  instances.  The 
remaining  213  of  the  313  titles  contained  more  than  one  keyword,  and 
exactly  191  titles  were  classified  correctly.  Thus,  270  out  of  313 
titles  were  automatically  given  correct  categories  and  so  for  group  1 
the  accuracy  was  about  86.3%. 

In  group  1  the  titles  with  at  least  one  keyword  were  classified 
correctly  with  the  quite  high  degree  of  accuracy  of  79%  and  89.7% 
respectively.  This  fact  is  not  surprising  when  it  is  recalled  that  all 
necessary  statistical  data  was  computed  on  the  basis  of  group  1.  In 


- 
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Table  6.1  Experimental  Results  of  Maron's  Method  Applied  to  Acoustics  Data. 
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groups  of  2  to  4,  however,  automatic  classification  gave  a  poor  pre¬ 
diction  (54.4%,  53.8%,  and  40.4%,  respectively)  of  correct  categories 
for  the  titles  with  only  one  keyword.  On  the  other  hand,  for  the 
titles  with  more  than  one  keyword  the  classification  was  relatively 
good,  with  an  accuracy  of  70.6%  in  group  2,  65.4%  in  group  3,  and  64.8% 
in  group  4.  Therefore,  for  titles  in  which  more  than  one  keyword  is 
used  to  index,  it  appears  that  a  high  degree  of  automatic  classification 
may  be  achieved. 

6 . 2  Discussion  of  Results. 

We  have  described  two  experiments  which  have  been  performed  in  order 
to  analyze  Maron's  automatic  classification  procedure.  The  first, 
described  in  Chapter  IV,  used  the  abstracts  of  the  IRE  Transactions  on 
Electronic  Computers;  the  other  used  titles  from  the  Journal  of  the 
Acoustical  Society  of  America.  We  cannot  expect  similar  results  from 
both  experiments  because  of  the  differences  in  the  type  of  data  (one 
comprised  abstracts,  the  other  comprised  titles),  the  methods  of  key¬ 
word  selection  and  the  category  selection.  However,  the  availability 
of  two  separate  sets  of  results  does  provide  more  ground  for  evaluation 
of  Maron's  theory  than  would  one  alone. 

Comparison  of  the  group  1  results  in  Tables  4.1  and  6.1  shows 
that  the  present  method  of  choosing  keywords  as  described  in  Chapter  V 
is  very  effective  in  that  for  the  documents  that  contain  only  one  key¬ 
word  there  are  79.0%  that  are  classified  correctly.  In  contrast, 

Maron's  choice  of  keywords  led  to  a  correct  classification  of  only 
48.7%  of  such  documents.  It  is  interesting  to  note  that  the  numbers 
of  correct  classifications  of  indexed  documents  (%  of  N^)  are  84.6% 
and  86.3%;  hence  one  may  conclude  that  for  the  acoustics  data  base  the 
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document  titles  provide  a  satisfactory  source  of  keywords. 

Comparison  of  results  for  groups  1  to  4  in  Table  6.1  shows  that, 
while  the  number  of  correct  classifications  is  less  for  group  2  to  4 
than  for  group  1,  the  number  does  not  change  significantly  with  inc¬ 
rease  in  the  time  interval  between  groups.  This  suggests  that  the 
vocabulary  of  significant  title  words  does  not  change  appreciably  from 
year  to  year  over  an  eight  year  period. 

The  20  to  25%  reduction  in  correct  classification  of  documents  not 
contained  in  group  1,  whether  they  contain  one  or  more  keywords,  sug¬ 
gests  that  the  classification  errors  are  caused  by  false  initial 
classification  or  unsuitable  titling  of  the  base  documents. 

6.3  Possible  Improvements  in  Procedure. 

Maron  suggested  four  methods  by  which  his  prediction  procedure 
might  be  improved.  The  first  way  is  to  use  more  documents  in  group  1 
in  order  to  collect  more  stable  statistical  data.  The  second  way  is  to 
increase  the  total  number  of  keywords  available  for  the  classification. 
The  third  way  is  to  apply  more  accurate  calculation  of  the  statistical 
terms;  for  example  in  order  to  predict  P(C. ;K. ,K.)  one  might  use 

K  I  J 

P(Ck)P(Ki.;Ck)P(K.;K.,Ck)  instead  of  P(Ck)P(K. »Ck)P(K^;Ck)  which  is  based 
on  an  assumption  of  independence  of  certain  probabilities.  The  fourth 
way  is  to  give  more  consideration  to  the  frequency  of  occurrence  of 
keywords  in  documents. 

The  first,  the  second,  and  the  fourth  methods  appear  likely  to  be 
profitable,  because  more  data  and  keywords  lead  to  more  accurate 
classification  statistics.  However,  there  are  two  reasons  why  we  cannot 
agree  completely  with  Maron's  third  suggestion.  One  is  that  implemen¬ 
tation  requires  a  large  computer  memory  to  store  more  accurate 
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statistics.  The  other  is  that,  although  logically  the  direct  com¬ 
putation  of  P(Ck;K.,Kj.)  =  P(Ck)P(Ki;C|<)P(Kj.;Ki  ,Ck)  instead  of  the 
approximate  P (Ck ) P ( ; C k ) P ( K ^ ;Ck)  should  lead  to  a  better  classification 
in  group  1,  it  is  doubtful  whether  the  same  is  true  for  the  other  groups 
because  there  exists  some  bias  between  the  groups.  For  the  groups  con¬ 
sidered  the  experimental  results  suggest  that  there  is  no  serious  error 
caused  by  the  assumption  that  in  any  category  the  keywords  occur  statis¬ 
tically  independently. 

In  summary,  attribute  analysis  for  automatic  classification  seems 
to  work  fairly  well.  We  believe,  moreover,  that  the  method  is  very 
satisfactory  for  documents  with  more  than  one  keyword.  It  is  less 
satisfactory  for  documents  with  only  one  keyword.  Therefore,  there  is 
a  need  to  derive  a  method  suitable,  not  only  for  documents  with  several 
keywords,  but  also  for  ones  with  only  one  keyword.  This  is  discussed 
in  the  next  Chapter. 
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CHAPTER  VII 

MODIFIED  ATTRIBUTE  ANALYSIS 

7 . 1  Maximization  of  Correct  Document  Classifications. 

7.1.1  Classification  System. 

The  present  section  describes  a  classification  system  which 

attempts  to  maximize  correct  document  classifications.  The  basic  theory 

is  similar  to  that  for  the  keyword  selection  as  described  in  Chapter  V. 

Suppose  that  a  document  is  indexed  by  a  set  of  M  keywords,  denoted 

by  {K. }^.  Of  all  documents  indexed  by  {K.}^1  let  N (C^ , { }^ )  be  the 

M 

number  in  category  C^.  Obviously,  for  a  document  indexed  by  {K..}.j,  the 
category  in  which  N(C,,,{K.},)  has  the  largest  value  is  the  best  one  in 

K  1  1 

which  to  classify  the  document. 

However,  if  all  possible  values  of  N(C^,{K. }-| )  are  to  be  stored 
for  reference,  then  a  very  large  table  is  required.  With  200  keywords 
the  possible  number  of  combinations  of  double  keywords  is  20,100  and 
there  are  1,353,400  combinations  of  triple  keywords,  etc.  Even  though 
the  1966  acoustic  titles  do  not  contain  all  these  possible  combinations 
of  double  or  triple  keywords  the  required  tables  are  still  large,  and 
the  execution  time  for  table  look-up  is  correspondingly  large.  There 
is  another  problem  in  that  when  a  request  has  a  new  combination  of  key¬ 
words  not  contained  in  the  tables  then  no  category  can  be  assigned  for 
it.  In  order  to  solve  these  difficulties,  Maron  assumed  that  in  each 
category  keywords  occur  statistically  independently.  He  then  computed 
P(C|<;{Ki. }^)  as  shown  in  the  formula  (4.4)  of  Chapter  IV. 

In  the  present  treatment  it  is  supposed  that  any  document  can  be 
properly  indexed  by  only  one  or  two  keywords.  Two  tables  are  therefore 
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stored  in  computer  memory.  One  is  for  the  classification  of  a  document 
indexed  by  a  single  keyword,  and  is  called  a  "single  keyword  table". 

The  other  is  for  the  classification  of  a  document  indexed  by  double 
keywords,  and  is  called  a  "double  keyword  table".  In  the  1966  acoustic 
titles,  the  200  single  keywords  produce  a  total  of  560  combinations  of 
double  keywords. 

The  single  keyword  table  contains  elements  as  shown  in  Table  7.1 
for  the  special  case  of  two  keywords  and  three  categories.  The  columns 
are  formed  from  the  columns  of  Table  5.1  that  correspond  to  the  selected 
keywords.  The  categories  are  those  that  correspond  to  the  peak  values 
in  columns  of  Table  5.1.  Thus  the  element  in  the  i  row  and  j  column 

indicates  the  number  of  documents  that  lie  in  the  i  category  and  con¬ 

tain  the  jth  keyword.  The  final  row  of  the  table  lists  the  "response 
category"  Ck  which  contains  the  most  documents  associated  with  the 
corresponding  keyword  K..  If  all  documents  that  contain  K ^  are  auto- 
matical ly  assigned  to  the  category  Ck#  then  the  difference  between  the 
number  of  correct  and  incorrect  assignments  is  as  shown  in  the" difference' 

row  of  Table  7.1. 

For  example,  in  Table  7.1,  the  largest  element  in  the  K-,  column 
is  10.  Thus  the  response  category  is  C-j ,  and  the  difference  is 
10  -  4  -  3  =  3. 

The  double  keyword  table  is  similar  but  the  columns  correspond  to 
keyword  pairs  instead  of  to  single  keywords.  Thus  each  column  of  the 
double  keyword  table  lists  the  number  of  documents  which  contain  a  given 
keyword  pair,  and  which  are  in  categories  C-j  ,C2, . . . .  ,C14.  The  last  but 
one  element  of  each  column  of  the  double  keyword  table  indicates  the 
difference  between  the  number  of  correct  classifications  and  the  number 
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of  incorrect  classifications  that  would  result  if  documents  are  assigned 
to  the  category  for  which  the  column  element  is  a  maximum.  The  last 
element  of  each  column  indicates  the  particular  category  C^. 

The  following  steps  indicate  an  algorithm  that  may  be  used  to 
classify  a  document  into  one  of  the  categories  C^; 

1.  Examine  the  document  for  the  presence  of  one,  or  more,  of  the 
200  keywords. 

2.  If  the  document  has  only  one  keyword,  then  go  to  step  3.  If  it 
has  only  two  keywords,  then  go  to  step  4,  otherwise  go  to  step  6. 
(If  no  keyword  appears  on  the  document,  our  classification 
procedure  is  not  applicable.) 

3.  Look  up  the  single  keyword  table,  and  determine  the  response 
category.  END. 

4.  Look  up  the  double  keyword  table.  If  the  keyword  pair  is  in  the 
table,  then  classify  the  document  under  the  corresponding  response 

category.  END.  Otherwise  go  to  step  5. 

5.  Refering  to  the  difference  corresponding  to  each  keyword  in  the 
single  keyword  table,  determine  which  keyword  gives  the  maximum 
difference.  Classify  the  given  document  under  the  corresponding 

response  category.  END. 

6.  Form  possible  pairs  of  keywords. 

7.  If  no  pair  is  on  the  double  keyword  table,  then  go  to  step  5. 
Otherwise  go  to  step  8. 

8.  Look  up  the  double  keyword  table  and  determine  a  word  pair  which 
has  the  maximum  difference  for  the  possible  pairs  in  the  document. 
Classify  the  document  under  the  corresponding  category.  END. 


’ 


Table  7.1 


An  Example  of  Single  Keyword  Table 


(consisted  in  the  instance  of  only 
two  keywords  and  three  categories) 


Ki 

K2 

C1 

10 

1 

C2 

4 

5 

C3 

3 

0 

Difference 

3 

4 

Response 

category 

ci 

C2 
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7.1.2  Experimental  Results, 

The  results  of  the  test  are  as  shown  in  Table  7.2. 

In  Table  7.2  the  number  of  titles  classified  by  the  single  keyword 
table  indicates  all  titles  which  contain  a  single  keyword  and  some  titles 
which  contain  more  than  one  keyword  of  which  no  pair  appear  on  the  double 
keyword  table.  Examination  of  part  A  shows  that  the  accuracy  of  correct 
classification  by  the  use  of  the  single  keyword  table  was  not  satis¬ 
factory  since  it  is  only  68%  for  group  1  and  54%  for  group  2.  On  the 
other  hand,  examination  of  part  B  shows  that  when  the  double  keywords 
table  may  be  used,  the  percentages  of  correct  classification  are  97.7% 
for  group  1,  which  is  almost  perfect,  and  89.7%  for  group  2.  This 
suggests  that  for  this  experimental  data  base  it  is  not  necessary  to 
generate  a  triple  keywords  table.  Part  C  of  Table  7.2  shows  the  results 
of  the  total  classification  system,  which  are  comparable  with  Maron's 
results  shown  in  Table  6.1.  For  group  1  this  method  which  classified 
88.2%  of  classifiable  titles  correctly  was  slightly  superior  to  Maron's 
one  which  had  86.3%  of  correct  classifications.  On  group  2,  however, 
both  methods  were  equally  satisfactory. 

The  most  striking  difference  between  Tables  7.2  and  6.1  is  in  the 
percentage  of  correct  classifications  when  the  double  keyword  table  may 
be  used.  The  percentages  97.7,  89.7,  78.8,  and  77.8%  obtained  by  use 
of  the  double  keyword  table  are  significantly  higher  than  the  89.7, 

70.6,  65.4,  and  64.8  listed  in  Table  6.1,  for  documents  that  contain 
two,  or  more,  keywords. 

To  examine  the  results  in  more  detail  we  may  tabulate  them  as  in 
Table  7.3  to  include  the  case  in  which  the  above  steps  1  to  8  rank  the 
response  categories  in  such  manner  that  the  correct  subject  category  has 


' 

' 

\ 


63 


Table  7.2  Experimental  Results  of  Maximization  Method. 


c 

roup  1  c 

;roup  2 

group  3 

group  4 

(1966) 

(1967) 

(1968) 

(1961) 

Total  number  of  titles 

Nt 

395 

385 

506 

286 

Number  of  titles  with 
no  keyword 

82 

102 

143 

96 

Number  of  titles  with 

N. 

313 

283 

363 

190 

at  least  one  keyword. 

k 

A 

Number  of  titles  clas- 

136 

sified  by  the  single 
keyword  table. 

Nki 

100 

215 

! 

1! 

116 

264 

Number  of  correct 
classifications . 

68 

152 

67 

%  of  Nkl 

68.0% 

54.0% 

57.6% 

49.3% 

B 

Number  of  titles  clas- 

99 

54 

sified  by  the  double 
keyword  table. 

Nk2 

213 

68 

! 

i 

Number  of  correct 
classifications . 

208 

1  61 

i 

78 

42 

1 

! 

%  of  N^2 

97.7% 

89.7% 

78.8% 

77.8% 

C 

! 

i 

t 

Number  of  titles 

283 

363 

190 

classified  by 
either  of  two  tables. 

Nkl+Nk2 

313 

Number  of  correct 
classifications . 

276 

177 

230 

109 

%  of  Nkl+Nk2  (Nk) 

88.2% 

62.52 

63.42 

57.4% 

%  of  Nt 

69.9% 

46.0°/ 

45.52 

38.1% 
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second  rank  in  the  list.  In  group  1  there  are  305  out  of  313  titles 
classified  correctly  in  one  of  the  first  two  ranks.  This  is  a  propor¬ 
tion  of  97.4%  of  the  group  1  with  at  least  one  keyword.  The  program 
following  Maron's  method  printed  out  287  as  the  number  of  titles  hav¬ 
ing  correct  categories  in  one  of  the  first  two  positions.  The  propor¬ 
tion  was  91.7%.  On  group  2,  our  method  listed  exact  categories  for 
213  titles  and  Maron's  method  did  for  210  titles  out  of  283  classifi¬ 
able  titles;  the  proportions  were  75.3%  and  74.2%  respectively . 

It  appears  that  the  method  of  the  present  section  is  consistently 
slightly  better  than  Maron's.  One  is  tempted  to  conclude  that  the 
classification  may  well  be  at  least  as  good  as  would  be  obtained  by 
manual  assignment  of  categories. 

It  is  envisaged  that  classification  of  documents  into  first  and 
second  rank  categories  might  be  used  in  information  retrieval  as  fol¬ 
lows.  A  searcher  who  requests  a  list  of  documents  in  a  certain 
category  would  first  receive  a  list  of  those  whose  highest  ranking 
response  category  is  that  specified.  His  request  could  be  broadened 
by  addition  of  the  documents  whose  second  highest  ranking  response  is 
the  one  specified.  Such  a  broadening  would  introduce  a  number  of  non- 
relevant  items  but,  according  to  Table  7.3,  would  include  many  relevant 
documents  that  were  not  correctly  classified  in  the  highest  rank. 

The  manner  in  which  various  documents  are  classified  in  the  first 
and  second  rank  is  shown  in  Appendix  D.  The  titles  listed  are  for 
the  documents  of  group  1.  The  keywords  are  underlined. 

One  of  the  most  significant  features  of  Table  7.3  is  its  indication 
of  how  close  Maron's  method  comes  to  approaching  the  accuracy  of  the 
method  that  does  not  depend  on  the  assumption  of  statistical  independence. 
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Table  7.3  Comparison  of  Maximization  Method  with  Maron's  Method 


Maximization 

Method 

Maron's 

Method 

Group  1  (1966) 

Number  of  titles  with  at  least  one  keyword,  N. 

313 

313 

Number  of  correct  classifications  listed 
in  the  first  rank. 

%  of  Nk 

276 

88.2% 

270 

86.3% 

Number  of  correct  classifications  listed 

29 

305 

97.4% 

1  7 

in  the  second  rank. 

Number  of  correct  classifications  listed 

287 

in  one  of  the  first  two  ranks. 

%  of  Nk 

91  .7% 

Group  2  (1967) 

j 

Number  of  titles  with  at  least  one  keyword,  Nk 

283 

283 

Number  of  correct  classifications  listed 

1  77 

176 

in  the  first  rank. 

%  of  N. 

|< 

62.5% 

62.2% 

Number  of  correct  classifications  listed 

36 

34 

in  the  second  rank. 

Number  of  correct  classifications  listed 

213 

210 

in  one  of  the  first  two  ranks. 

%  of  Nk 

75.3% 

74.2% 

Group  3  (1968) 

Number  of  titles  with  at  least  one  keyword,  Nk 
Number  of  correct  classifications  listed 

363 

363 

231 

216 

in  the  first  rank. 

%  of  Nk 

63.6% 

59.5% 

Number  of  correct  classifications  listed 

44 

45 

in  the  second  rank. 

Number  of  correct  classifications  listed 

275 

261 

in  one  of  the  first  two  ranks. 

%  of  Nk 

75.8% 

^  i 

• 

Group  4  (1961 ) 

Number  of  titles  with  at  least  one  keyword,  Nk 

190 

190 

Number  of  correct  classifications  listed 

109 

108 

in  the  first  rank. 

%  of  Nk 

57.4% 

56.8% 

Number  of  correct  classifications  listed 
in  the  second  rank. 

29 

24 

Number  of  correct  classifications  listed 

138 

132 

in  one  of  the  first  two  ranks. 

%  of  Nk 

72.6% 

. . - 

69.5% 
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However,  it  should  be  remarked  that  this  is  also  a  verification  of  our 
method  of  choice  of  keywords.  Maron  did  not  use  our  method,  and  with 
alternative  choice  of  keywords  the  results  of  Maron 's  method  might  be 
much  poorer. 

It  should  also  be  borne  in  mind  that  Maron's  procedure  is  based 
on  an  approximation  to  the  joint  probability  of  occurrence  of  all  the 
keywords  in  the  document.  In  contrast,  our  method  is  based  on  a  know¬ 
ledge  of  the  exact  frequencies  of  occurrence  of  single,  and  pairs  of, 
words.  It  appears,  therefore,  that  exact  information  about  occurrences 
of  pairs  of  keywords  is  somewhat  more  useful  than  approximate  predic¬ 
tions  of  frequencies  of  higher  numbers  of  occurrences. 

7.1.3  Suggestions  and  Discussion. 

The  hypothesis  of  the  method  described  in  the  previous  sections  is 
that  any  document  can  be  indexed  by  use  of  tables  based  on  statistical 
distributions  of  single  or  pairs  of  keywords.  In  accordance  with  this 
hypothesis  the  single  keyword  table  and  double  keyword  table  were 
prepared  for  the  acoustic  titles. 

As  was  mentioned  in  section  7.1.1,  when  a  document  to  be  indexed 
has  a  pair  of  keywords  K-j  and  K2,  but  the  double  keyword  table  does  not 
include  such  a  pair,  then  one  of  K-j  and  K2  is  chosen  according  to  which 
corresponds  to  the  larger  difference  value  in  the  single  keyword  table. 
Thus  the  existence  of  one  keyword  in  the  document  is  completely 
neglected.  One  suggestion  may  be  made  to  avoid  this  irrationality. 

Instead  of  choosing  one  keyword  from  the  pair  K-j  and  K2»  evaluate 
the  number  of  documents  classified  by  K-j  or  K2>  for  example  in  Table 
7.1  take  the  sum  of  the  two  column  vectors  K-j  =^10^ and  K2  -^lyo  get 
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K-|  +  l<2  =^1  ly  The  category  in  which  the  largest  number  of  doc¬ 
uments  can  be  classified  correctly  by  either  K-j  or  may  be  chosen  as 
the  response  category  for  the  request.  By  this  improvement,  all  the 
keywords  in  a  request  can  participate  in  determining  the  response 
category  without  neglecting  any  of  the  keywords. 


7 . 2  Modification  of  Maron's  Method  Using  Keyword  Association. 

7.2.1  Classification  System  Based  on  Keyword  Association. 

By  knowing  the  relationships  between  keywords,  a  document  can  be 
extended  by  some  additional  keywords,  and  then  the  classification  system 
will  be  able  to  assign  more  correctly  the  category  to  the  document. 

A  measure  of  the  degree  in  which  keywords  are  associated  within 
documents  may  be  formulated  as  follows.  Suppose  that  N(K. )  and  N(K.) 

*  vJ 

are  the  frequencies  with  which  documents  are  indexed  by  keywords  K. 

and  K.  respectively .  Let  N(K.,K.)  be  the  frequency  with  which  both 
J  ^  J 

K.  and  K.  index  documents.  The  probability  that  a  document  containing 
J 

K.  also  contains  K.  may  be  computed  as 


P(K.;K.) 
j  i 


N(K.,K.) 

N(K^) 


(7.1) 


which  qives  a  measure  of  the  extent  to  which  K.  tends  to  occur  in 
3  J 

documents  that  contain  K..  If  no  document  contains  both  K.  and  K., 

J 

then  P(K.;K.)  =  0.  If  every  document  containing  K.  also  contains  K., 
j  i  i  j 

then  P(K.;K.)  =1.  In  general,  P(K.;K.)  f  P(K.;K.),  since  for  example, 
j  i  J  *  J 

the  word  "retrieval"  tends  to  be  used  with  the  word  "information",  but 
"information"  is  often  used  without  association  with  "retrieval". 
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The  proposed  application  of  a  keyword  association  technique 
results  in  a  modification  of  Maron's  classification  method  as  described 
below.  Assume  that  a  request  document  is  indexed  by  only  one  keyword, 
indicated  by  K . .  The  keyword  K.  is  associated  with  the  keyword  Fy 
(r=l  to  200)  to  an  extent  measured  by  P(Kr;K. ).  Obviously  the  degree 
of  association  of  K.  with  itself  is  equal  to  1,  viz.  P(K. ;K.)=  1. 

The  probability  of  P(Ck;K.)  that  a  document  indexed  by  K.  belongs  to 
category  is  modified  as  follows: 


P(Ck;K.) 


200 

l  P(Ck;Kr)P(Kr;K.) 

r=l 


(7.2) 


In  formula  (7.2)  each  term  that  appears  on  the  right-hand  side  denotes  the 
individual  attribute  number  computed  between  category  Ck  and  keyword  Kp 
which  relates  to  the  given  keyword  K,.  by  a  degree  P(Ck;K1-)-  If  a  keyword  Fy 
is  closely  related  with  then  the  attribute  number  P(Ck ; Kr )  is  consid¬ 
ered  as  an  important  factor.  Thus  the  probability  P(Kr;K.)  may  be 
regarded  as  the  weight  through  which  P(Ck;Kr)  contributes  to  the  value 

of  P(Ck;K.). 

Next,  assume  that  a  given  document  is  indexed  by  M  number  of  key¬ 
words,  indicated  by  {K. }^.  Maron  derived  the  formula  of  an  attribute 

number  as  follows: 


M 

p(ck;{K.}fJ)  :  p(ckyp(K.;ck)  (7 

which  is  given  in  formula  (4.4)  of  Chapter  IV.  There  is  a  statistical 
relation  between  probabilities  as  follows: 


fertc  v^Jl  9  tf  ri  rw  NJr.  )oe;  s  zr  J  Jb-i  wys  ertT  .  X  v  ■  l><  JssTbrif 
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P(Ck;K.)P(K.) 


(7.4) 


Substitute  (7.4)  into  (7.3)  that 


P(ck;(K.}N)  ;  p ( c k ) tt 


M  P(Ck:K.)P(K.) 

TT  -  - 


(7.5) 


M 

where  tt  P(K.)  is  independent  of  categories  and  therefore  its  computation 
i *  1 

is  unnecessary.  The  formula  (7.5)  may  therefore  be  simplified  to  the 
form 


1-M  M 


(7.6) 


P(ck;{K.}rj1)  :  P(Ck)  tt  P(Ck;K.) 


Substituting  the  formula  (7.2)  into  (7.6)  then  gives 


M  1-M  M  200 


(7.7) 


where  k  varies  from  1  to  14. 

The  derived  formula  (7.7)  is  the  general  form  of  attribute  number 
modified  by  keyword  association.  We  call  the  resulting  number  a 
"modified  attribute  number".  The  necessary  probabilities  P(Ck) sP(Ck;Kr) 
and  P(Kr;Ki )  for  the  computation  of  the  modified  attribute  number  are 
defined  in  the  following  manner; 


t  h 

number  of  documents  belonging  to  the  k  category 


total  number  of  documents 

i.  L 

number  of  documents  with  the  rtn  keyword  belonging  to 

the  kth  category 


number  of  documents  containing  the  r  keyword 


' 
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P(Kr;K.)  = 


number  of  documents  containing  both  the  ixn  and  r  ' 

keywords 


number  of  documents  containing  the  itn  keyword 


7.2.2  Experimental  Results. 

Following  the  manner  of  the  previous  experiments,  the  modified 
classification  system  was  tested  on  acoustic  1966,  1967,  1968,  and 
1961  data  separately.  The  figures  derived  from  this  experiment  are 
shown  in  Table  7.4. 

In  Table  7.4,  the  results  of  Maron's  method  from  Table  7.3  are  ( 
repeated  in  order  to  clarify  the  comparison  with  our  modified  method. 

It  is  clear  that  the  modified  method  has  no  advantage  over  Maron's 
method.  In  fact  it  is  slightly  poorer  than  the  previous  method  whose 
results  are  in  Table  7.3.  This  suggests  that  information  about  word 
associations  is  less  useful  than  information  about  the  words  actually 
present  in  the  documents. 

The  above  fact  is  not  surprising  in  view  of  the  extremely  careful 
way  of  choosing  the  keywords.  If  the  keywords  had  been  chosen  in  a  less 
optimum  manner,  then  some  important  keywords  might  have  been 
omitted,  in  which  case  they  could  influence  the  choice  of  category 
only  through  the  effect  of  non-zero  values  of  P ( Kr ; K . ) . 

7.2.3  Discussion. 

Comparing  two  methods  from  the  results  in  Table  7.4  for  the  titles 
classified  correctly  in  the  first  rank,  the  two  methods  were  very 
similar  in  their  determination  of  categories.  But  for  group  2,  the 
modified  method  performed  poorly.  In  group  1,  the  modified  method 
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Table  7.4  Experimental  Results  of  Modified  Method  and  Their  Comparison 

with  Those  of  Maron's  Method 


Modified 

Method 

Maron's 

Method 

Group  1  TT%6) 

Number  of  titles  with  at  least  one  keyword,  N, 

Number  of  correct  classifications  listed 

in  the  first  rank. 

%  of  N, 
k 

Number  of  correct  classifications  listed 
in  the  second  rank. 

Number  of  correct  classifications  listed 

in  one  of  the  first  two  ranks. 

%  of  N. 
k 

313 

270 

86.3% 

24 

294 

93.9% 

313 

270 

86.3% 

17 

287 

91.7% 

j  Group  2  (1967) 

Number  of  titles  with  at  least  one  keyword,  N. 
Number  of  correct  classifications  listed 
in  the  first  rank. 

%  of  Nk 

Number  of  correct  classifications  listed 
in  the  second  rank. 

Number  of  correct  classifications  listed 
in  one  of  the  first  two  ranks. 

%  of  Nr 

283 

175 

61.8% 

26 

201 

71.0% 

283 

176 

62.2% 

34 

i 

210 

74.2%  ! 

j 

Group  3  (1968) 

i 

Number  of  titles  with  at  least  one  keyword,  N^ 

363 

363 

Number  of  correct  classifications  listed 

1 

in  the  first  rank. 

207 

216 

%  of  Nk 

57.0% 

59.5% 

i 

Number  of  correct  classifications  listed 

53 

45 

in  the  second  rank. 

Number  of  correct  classifications  listed 

260 

261 

in  one  of  the  first  two  ranks. 

%  of  Nk 

71.6% 

71.9% 

Group  4  (1961 ) 

Number  of  titles  with  at  least  one  keyword,  Nk 

190 

190 

Number  of  correct  classifications  listed 

99 

108 

in  the  first  rank. 

%  of  Nk 

52.1% 

56.8% 

Number  of  correct  classifications  listed 

28 

24 

in  the  second  rank. 

Number  of  correct  classifications  listed 

127 

132 

in  one  of  the  first  two  ranks. 

%  of  Nk 

66.8% 

69.5% 

•iU  1  jH  2  1  »*n  to  'ledrcuM 
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produced  some  improvement  in  choice  of  correct  classifications  for 
the  titles' 1  isted  in  the  second  rank. 

The  above  facts  may  imply  the  following  conclusions.  For  the 
titles  which  Maron's  method  classified  correctly  in  the  first  rank, 
keywords  of  these  titles  are  strongly  associated  with  their  correct 
categories.  Therefore  the  values  of  P ( Kr ; K^. )  used  in  the  modified 
method  hardly  affect  the  choice  of  correct  categories  for  such  titles. 

On  the  other  hand  the  titles  classified  correctly  in  the  second  rank 
may  be  regarded  as  having  relatively  weak  associations  with  their 
correct  categories,  in  which  case  the  attribute  numbers  P(C^;Kr)P(Kr;Ki. ) 
of  such  keywords  that  are  strongly  associated  with  keywords  K.  in  a 
title  give  rise  to  the  better  classification. 

From  Appendix  B,  which  indicates  the  similarity  coefficients  bet¬ 
ween  three  groups  of  data,  it  may  be  seen  that  the  similarity  coefficient 
between  group  1  and  group  2  has  the  lowest  value.  This  implies  that  the 
behavior  of  the  keyword  distributions  of  group  1  differs  somewhat  from 
that  of  group  2.  This  may  cause  a  decrease  in  the  number  of  correct 
classifications  for  the  second  rank  in  group  2. 

From  the  results  shown  in  Table  7.4  it  appears  that  the  attribute 
numbers  provide  a  useful  means  to  classify  documents  into  categories, 
and  that  a  great  deal  of  improvement  cannot  be  expected  by  the  use  of 
the  modified  method. 

7.2.4  Suggestion. 

In  the  modified  method  it  is  assumed  that  every  keyword  originally 
appearing  on  a  document  is  equally  significant.  This  assumption  may  not 
be  realistic.  When  we  analyze  a  document,  first  we  look  for  the 
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important  sentences  and  words,  and  next  we  rank  them  in  significant 
order. 

A  further  suggestion  is  to  generate  a  classification  system  that 

involves  the  concept  that  every  keyword  in  a  document  has  a  significant 

factor,  or  weight.  Suppose  that  a  request  document  contains  M  keywords, 

M 

indicated  by  {K^  including  additional  keywords,  and  that  by  some 

method  all  weights  of  these  keywords,  indicated  by  {w.}-j,  are  determined 

M 

and  that  the  weight  w.  is  independent  of  categories.  P(Ck;{w.K.}.| ) 

M 

defines  the  probability  that  the  request  document  indexed  by  { }-| 

M 

with  weights  }-j  belongs  to  category  C^. 

Let  us  make  an  assumption  that  the  weight  of  a  keyword  is  also 
the  weight  of  the  probability  that  a  document  in  a  category  Ck  may  be 
indexed  by  this  keyword.  This  can  be  formulated  as 


P(w.K.;Ck)  =  w.P(Ki;C|<)  (7-8) 

For  the  approximation  of  the  value  P (C^; (w . K- }-| )  two  possible 
approaches  may  be  used.  First,  the  value  may  be  set  as  the  sum  of  the 
attribute  numbers  between  each  keyword  K.  and  category  multiplied 
by  the  weight  w^ ;  thus; 


M 


(7.9) 


The  form  of  the  right  hand  side  of  (7.9)  ensures  that  if  a  document 
contains  many  keywords  of  high  weights  strongly  associated  with  a  certain 
category,  then  the  attribute  number  of  the  document  becomes  large.  Even 
if  a  keyword  is  strongly  associated  with  a  category,  but  the  weight  of 
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the  keyword  in  the  document  is  small,  then  the  term  on  the  right  hand 
side  of  (7.9)  cannot  be  considered  to  be  important. 

Another  approach  is  based  on  the  assumption  that  in  each  category 
the  keywords  occur  statistically  independently.  The  attribute  number 
P (Ck i (w . Ki  }.| )  may  then  be  modified  to  become 


P(Ck;{w.K.}”)  = 


P({wl.K1}fjI;Ck)P(C|() 

P(tw1Kt>”) 


(7.10) 


M 

where  the  denominator  P ( {w . K . } ^ )  is  independent  of  categories  and  hence 
may  be  eliminated.  By  the  assumption  of  keyword  independence,  the 
formula  (7.10)  is  simplified  as  follows: 


M 


P(Ck;{w.K.},11)  *  P(Ck)  tt  P(wiK.;C|<) 


(7.11) 


Substitution  of  the  relations  (7.8)  into  (7.11)  leads  to  the  formula 


,M 


M 


P(Ck;{w.K.}'11)  *  P(Ck)  tt  w.P(Ki;Ck) 


(7.12) 


The  method  proposed  by  Stiles  in  1961  (20)  appears  suitable  for 
estimation  of  the  value  of  the  weights.  It  not  only  computes  the  weights 
of  keywords,  but  also  produces  the  additional  keywords  to  extend  a 
request  document  as  well.  The  precise  procedure  is  shown  in  Appendix 
C. 

We  have  not  tested  the  classification  methods  suggested  in  the 


formulae  (7.9)  and  (7.12). 
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CHAPTER  VIII 

CONCLUSIONS 

The  present  thesis  has  studied  automatic  classification  systems 
based  on  statistical  relationships  between  words  and  subject  categories 
of  documents.  Several  experimental  trials  have  been  described. 

Using  the  IBM  360/67  computer  installed  at  The  University  of 
Alberta  Computing  Center,  experiments  were  designed  using  titles  and 
authors  published  by  JASA  (the  Journal  of  the  Acoustical  Society  of 
America)  in  1966,  in  1967,  in  1968,  and  in  1961. 

Chapter  II  and  Chapter  III  contain  an  examination  of  the  applic¬ 
ability  of  latent  class  analysis  to  document  classification.  The  latent 
class  analysis  is  critically  dependent  on  the  assumption  that,  in  each 
latent  class,  keywords  occur  statistically  independently.  This  assump¬ 
tion  is  expressed  through  the  form  of  the  accounting  equations  (2.5), 
(2.6)  and  (2.7)  in  Chapter  II.  From  the  analysis  of  word  occurrences 
latent  class  analysis  can  provide  a  set  of  classification  categories  as 
well  as  probabilities  between  words  and  categories.  The  examination  of 
latent  class  analysis  provided  clear  illustration  of  its  unsuitability 
for  document  classification  systems.  Determination  of  the  number  of 
latent  classes  is  a  very  difficult  problem.  The  hypothesis  of  Winters 
that  the  number  of  latent  classes  is  equal  to  the  number  of  keywords 
facilitates  the  numerical  solution  of  the  accounting  equations,  but  our 
attempt  to  apply  Winters'  technique  was  unsuccessful.  Moreover,  it  is 
doubtful  whether  the  latent  classes  derived  from  the  latent  class 
analysis  are  meaningful.  It  must  be  concluded  that  the  strictly 
mathematical  latent  class  analysis  is  not  a  useful  tool  with  which  to 
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attack  the  problem  of  document  classification. 

Essentially,  the  same  assumptions  are  used  to  justify  Maron's 
attribute  analysis.  Both  analyses  assume  the  statistical  independence 
of  keyword  occurrence  in  each  subject  category  or  latent  class.  The 
difference,  however,  exists  in  the  procedure  used  to  determine  the 
classification  schedule.  Attribute  analysis  requires  a  set  of  base 
data  which  are  classified  correctly  according  to  a  pre-existing  clas¬ 
sification  schedule.  In  contrast  the  latent  class  analysis  uses 
neither  pre-existing  base  data  nor  a  classification  schedule.  In 
Chapter  IV  it  was  shown  that  use  of  attribute  numbers  for  documents 
keyworded  by  more  than  one  word  assigns  a  correct  category  very  succes¬ 
sfully.  It  was  concluded  that  attribute  analysis  forms  a  promising 
method  for  document  classification. 

One  of  the  methods  proposed  in  the  present  thesis  is  a  maximizat¬ 
ion  method  of  correct  classification  as  described  in  Section  7.1  of 
Chapter  VII.  For  the  documents  keyworded  by  single,  or  pair  of,  words 
the  maximization  method  uses  direct  statistical  descriptions  of  the  base 
data  instead  of  approximations  as  calculated  in  Maron's  method  based  on 
the  independence  assumption.  In  comparison  with  Maron's  method,  the 
experimental  results  appear  to  be  slightly  improved.  The  results  sug¬ 
gest  that  Maron's  approximation  that  keywords  occur  statistically 
independently  in  each  subject  category  holds  meaningfully  among  docu¬ 
ments  whose  keywords  are  chosen  from  natural  language. 

A  modification  of  Maron's  method  in  terms  of  keyword  associations 

i 

was  proposed  in  an  attempt  to  improve  the  classification  of  documents 
that  contain  relatively  few  keywords.  However,  the  method  did  not  lead 
to  improved  classification.  It  appears  that  the  keywords  that  are 
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themselves  contained  in  a  document  are  better  clues  for  assignment  of 
correct  categories  than  are  extended  keywords  derived  from  keyword 
assocations . 

The  above  fact  may  imply  that  the  scheme  used  to  select  200  key¬ 
words  for  the  present  experiments  helps  a  great  deal  to  ensure  correct 
classifications.  If  the  keywords  are  selected  less  carefully,  however, 
the  attribute  numbers  corresponding  to  extended  keywords, 

P(C.  iK  )P(K  ;K.),  in  the  modified  method  may  be  necessary  in  order  to 
extend  a  request  document  and  its  correct  category. 

Throughout  the  experiments  it  was  found  that  the  proposed  method 
of  choice  of  keywords  is  very  suitable  for  the  classification  of  docu¬ 
ment  titles.  Titles  used  in  the  present  experiments  contain  an  average 
of  8  or  9  words  and  1  or  2  keywords.  Therefore,  the  use  of  direct 
statistics  on  occurrences  of  more  than  two  keywords  was  impossible. 
However,  the  direct  application  of  this  method  to  the  classification  of 
abstracts  or  full  text  may  involve  some  problems  relating  to  the  memory 
size  and  execution  time  of  the  computer. 

One  of  the  conclusions  of  the  present  investigation  is  that  docu¬ 
ment  titles  may  provide  a  very  useful  source  of  keywords  for  classifi¬ 
cation.  For  the  acoustics  data  base  the  classification  effectiveness 
does  not  significantly  change  with  respect  to  time  except  for  an 
initial  reduction  when  the  data  is  extended  beyong  the  base  documents. 
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APPENDIX  A 

List  of  Keywords  Chosen  as  Described  in  Chapter  V 


ABSEN 

21 

BOER 

41 

CLICK 

61 

DUE 

81 

FLORI 

AMBIE 

22 

BOOMS 

42 

COCHL 

62 

DURAT 

82 

FLOW 

APPRO 

23 

BOOM 

43 

CONIC 

63 

DURLA 

83 

FREE 

ARASE 

24 

BOTTO 

44 

CONSO 

64 

EAR 

84 

F2 

ARCTI 

25 

BROWN 

45 

CRYST 

65 

EARCA 

85 

GELLE 

ATTAC 

26 

BULLF 

46 

CYLIN 

66 

EARPH 

86 

GEOME 

AUDIO 

27 

BURKE 

47 

DALLO 

67 

EC 

87 

GOLD 

AUDIT 

28 

BURST 

48 

DAM  PE 

68 

ELEME 

88 

GOODM 

AXIAL 

29 

CABLE 

49 

DAMP  I 

69 

ELFNE 

89 

GOULD 

AXISY 

30 

CAL  IB 

50 

DATA 

70 

ENGLI 

90 

GUINE 

BATCH 

31 

CALLA 

51 

DEATH 

71 

ERRAT 

91 

HAVIN 

BAUER 

32 

CAMPB 

52 

DECIS 

72 

EXAMI 

92 

HEAR  I 

BEAMS 

33 

CANT  I 

53 

DEEP 

73 

EXIST 

93 

HECKE 

BE  ATT 

34 

CARHA 

54 

DENHA 

74 

EXPLO 

94 

HENNI 

BENZE 

35 

CAROM 

55 

DICHO 

75 

EXPOS 

95 

HODGE 
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APPENDIX  B 


Similarity  of  Successive  Years  of  Data  Base 


It  has  been  supposed  that  the  statistics  of  the  base  data  are 
similar  to  those  of  the  data  to  be  classified. 

Suppose  that  a  group  of  data  is  represented  in  vector  form  where 
each  element  of  a  vector  indicates  the  frequency  of  occurrence  of  the 
corresponding  word  on  the  data,  and  that  there  are  two  groups  of  data 
such  as; 


(B-l ) 


where  N  is  the  total  number  of  different  words  on  two  groups. 
The  cosine  coefficient  is  computed  as  follows: 

N 


(B-2) 


=  cosine  coefficient 


In  N  dimensional  space,  if  the  angle  between  T  and  U  is  0°  then 
the  cosine  coefficient  is  1,  which  means  that  the  two  groups  of  data 
are  identical  except  for  an  amplitude  factor.  If  the  angle  is  90°, 
then  the  cosine  coefficient  is  0,  which  means  that  there  are  no  common 

keywords  that  index  the  two  groups  of  data. 

The  formula  (B-2)  has  been  used  to  compute  the  similarity  between 
the  acoustic  1966,  1967,  and  1968  titles.  The  results  are  shown  in 


. 

. 

' 

, 


matrix  form  as  follows: 


(1966) 

(1967) 

(1968) 

(1966) 

1.0 

0.85709 

0.86216 

(1967) 

0.85709 

1.0 

0.87804 

(1968) 

0.86216 

0.87804 

1.0 

Among  the  coefficients  in  (B-3)  the  one  between  acoustic  1967 
and  acoustic  1968  titles  has  the  highest  value,  the  next  highest  one 
is  between  acoustic  1966  and  acoustic  1968  titles.  Therefore,  the 
choice  of  acoustic  1968  as  a  base  data  is  the  best  among  three  groups. 
However  the  correlation  is  not  significantly  different  for  any  two  of 
the  three  years. 
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Stiles'  Measure  of  Association  Factor  and  Choice  of  Keywords 

In  his  paper  (20)  Stiles  introduced  a  formula  to  measure  the  degree 
of  association  between  two  keywords  and  named  it  an  "association  factor" 
defined  as  follows: 

(IV  -  NaNbl  -|)2n 

log-,n  - -  =  association  factor  (C-l) 

10  NA(N  -  Na)(N  -  Nb) 

where  N  is  the  total  number  of  documents;  Na  is  the  number  of  documents 

a 

indexed  by  word  A;  Nb  is  the  number  of  documents  indexed  by  word  B;  and 

Nab  is  the  number  of  documents  indexed  by  both  A  and  B.  If  NQbN  is  less 

than  N  N.  ,  the  association  factor  must  be  considered  negative.  In  the 
a  D 

computation  of  the  association  factor  between  word  A  and  itself, 

N  =  N  . 
aa  a 

The  following  steps  describe  how  to  generate  the  additional  key¬ 
words  to  extend  a  request  document  and  how  to  determine  the  weights  of 
the  keywords. 

1.  For  each  keyword  on  a  request  document,  form  a  "profile"  consist¬ 
ing  of  all  keywords  which,  in  associat  ion  with  the  given  one,  have 
association  factors  greater  than  1.0. 

2.  From  the  profiles  of  all  keywords  on  the  document,  select  addit¬ 
ional  keywords  that  appear  frequently  in  the  set  of  profiles.  They 
are  called  the  first  generation  keywords. 

3.  For  the  original  keywords  and  the  first  generation  keywords, 
repeat  step  1  and  step  2  and  select  second  generation  keywords. 

4.  For  each  keyword,  including  the  original  keywords,  the  first 
generation  keywords  and  the  second  generation  keywords,  compute  the 


' 
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sum  of  the  association  factors  on  its  profile.  The  sum  is  called  the 
weight  of  the  keyword.  This  weight  is  a  measurement  of  the  degree  of 
association  between  the  keyword  and  the  request  document. 


APPENDIX  D 


First  and  Second  Rank  Classification 

Using  Modified  Attribute  Analysis 


pp.  89  -  97. 


(The  notation  of  14  categories  A,  B,  .  ..,N 
used  on  the  following  list  corresponds  to 
that  of  the  categories  1,  2,  ...»  14  used 
in  section  5.2,  respectively .  Keywords  are 
underl ined. ) 
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